shuv50 / Cancer-Prediction-Genomes-AzureML

Cancer-Prediction-Genomes-AzureML
0 stars 1 forks source link

cancer types in data processing #40

Open asaki1986 opened 5 months ago

asaki1986 commented 5 months ago

Hi,

I've practiced the model training following the tutorial you provided.

The gdcCancer.bb file was downloaded from the link in the tutorial, and later was converted to bed file using bigBedToBed.

However, it seems that some of the variants belong to various cancer types in the file, which are different from the information in the tutorial (TCGA-OV,TCGA-LUAD,TCGA-BRCA,TCGA-CES).

Since the cancer type should be unique in the later Label encoding, could you please give me some advice, to remove those lines, or just dump those lines into several lines of the same mutation with different caner types?

Best, Junfeng

asaki1986 commented 5 months ago

I've done both dedump those variants with various cancer types and remove those variants as well.

The accuracy of Random Forest Classifier for retain variants is 0.2653, and 0.3411 for removing those variants, which is quite smaller against the value in the tutorial.

Meanwhile, there are 33 cancer types in the latest gdcCancer data. I tried both 33 and 20 (listed in the tutorial), the prediction accuracy did not change a lot.

It will be greatly appreciated if you can give some suggestions.

Best, Junfeng