mskspi / PathCNN

Interpretable convolutional neural networks on multi-omics data predict long-term survival in glioblastoma
13 stars 8 forks source link

Some questions about the LGG LUAD KIRC experiment #5

Open Y-Claw opened 1 year ago

Y-Claw commented 1 year ago

Are the mrna and CNVS used in your LGG,LUAD,KIRC experiments logarithmic or normalized? May I ask whether the network structure and network hyperparameters are consistent with the gbm experiment?

mskspi commented 1 year ago

We used normalized mRNA data and actual copy numbers for CNV, all downloaded from cBioPortal. The network architecture is the same for all cancers, but was trained, separately, for each cancer.

Y-Claw commented 1 year ago

Could you please tell me whether the specific normalization method used is linear normalization, z-score or other normalization algorithms, as this may cause slight fluctuations in the results.

mskspi commented 1 year ago

We used z-transformed gene expression data.

Y-Claw commented 1 year ago

Thank you very much. Just to be sure, you used the raw data for methylation in the experiment? Z-transformed mRNA is mrna expression z-scores relative to diploid samples or transformed by the original mrna expression?

mskspi commented 1 year ago

Sorry. I checked the data we used again. In this study, we used data_mrna_affymetrix_microarray. I was confused with our other studies where we analyzed z-transformed gene expression data. For methylation, we used the raw data.

Y-Claw commented 1 year ago

For GBM, there is plenty of microarray data, but for directly downloaded from the http://www.cbioportal.org/, LGG LUAD KIRC microarray data that is not enough. May I ask whether you downloaded the data directly from www.cbioportal.org or obtained it by other means?

mskspi commented 1 year ago

For GBM, you used microarray gene expression data due to much more samples compared to RNA-Seq data. I think that GBM is the only cancer in TCGA with the smaller number of samples in RNA-Seq than microarray. For other cancers, we used RNA-Seq gene expression data. Yes, we downloaded all data from cBioPortal. After our work, cBioPortal provides more data, for example, several types of RNA-Seq expression data with different normalization methods. I will check the file names for other cancers to find out the data we used.

mskspi commented 1 year ago

For other cancers, we used "data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt" for gene expression data.

Y-Claw commented 1 year ago

Thank you. I'll try again.

Y-Claw commented 1 year ago

Even if we aligned the raw data, I still couldn't achieve the results reported in the paper on the other data. There are still some data processing details that could make a difference, but it's not a good idea to keep bothering. Can you provide the code for your experiments on other data sets? I'll check the questions myself.

mskspi commented 1 year ago

Did you make pathway images that need to be used in our model?

Y-Claw commented 1 year ago

Do you mean that the other three datasets use different pathways than GBM? The pathway that has to be re-selected for each cancer?

mskspi commented 1 year ago

The input data to our model are pathway images generated using PCA on three multi-omics data. I meant that "did you make the input pathway images?".

Y-Claw commented 1 year ago

Yes, I use pathway images generated using PCA on three multi-omics data as input.

mskspi commented 1 year ago

Files for lgg were added. Please note that the threshold for other cancers to define long-term survival is 36 months.

Y-Claw commented 1 year ago

Thank you very much. The data you provided is indeed better than what I have been able to process. With the data provided by you, I can reach auc 86.5 after adjusting the hyperparameters. But there are still some differences from what is reported in the paper. During the modification, I noticed that the risk should be set to 36 months and changed the class weight. Does this mean that the difference in performance is only due to hyperparameters? I'll try more hyperparameters in the next few days.

mskspi commented 1 year ago

Did you remove the cases who survived with the last follow-up <= 3 years in lgg? Since the analysis is based on cross validation, there is some variability in AUC at each iteration. Also it is possible that the optimized parameters are different.

Y-Claw commented 1 year ago

Yes, to confirm the effect, I made some modifications to your code to test the effect. When changing the risk to 36 months, your code will remove the cases who survived with the last follow-up <= 3 years. For variability in AUC, the AUC for each iteration is not very convincing. I ran the same experiment as yours, but only running 5 rounds to test the hyperparameter faster.

Y-Claw commented 1 year ago

Do you still have a record of the best hyperparameters on lgg?

mskspi commented 1 year ago

In LGG, 512 nodes were used in a fully connected layer along with age as a clinical variable.

Y-Claw commented 1 year ago

May I ask how you handle NA values?Do you eliminate the whole column, fill it with zeros or normal values?

Y-Claw commented 1 year ago

After hyperparameter adjustment, I can use your data to train the same results in the paper, but the data I process is really different from the data you provided. The lgg data I downloaded from here used sklearn's default pca to conduct pca for each pathway individually. Could you please provide a copy of lgg data processing code? I'll check the details myself.

Y-Claw commented 1 year ago

I'm sorry to bother you, but could you please let me know if you are still using the KEGG database version c2.cp.kegg.v5.2.symbols.gmt in the lgg experiment? Also, is the methylation data downloaded from cbioportal, specifically the lgg_tgca(http://www.cbioportal.org/study/summary?id=luad_tcga) data_methylation_hm450.txt file? I tried applying PCA using the aforementioned data in Python's scikit-learn library, but the results I obtained are slightly different from the methylation data you uploaded. I also processed other omics data in a similar manner, and there are some differences compared to the data you uploaded. Could you please inform me about any differences in your settings or provide a data processing script (even if it cannot be directly executed) to align the data processing workflow? Thank you very much.

mskspi commented 1 year ago

Yes, we used c2.cp.kegg.v5.2.symbols.gmt. All TCGA data were downloaded from cBioPortal. In the data folder, a folder (LGG) was created where you can find two matlab codes.

  1. data_merge_small_pathway.m: perform PCA analysis on multi-omics data
  2. reorder_data.m: reorder PCA results based on the correlation of pathways