Open Y-Claw opened 1 year ago
We used normalized mRNA data and actual copy numbers for CNV, all downloaded from cBioPortal. The network architecture is the same for all cancers, but was trained, separately, for each cancer.
Could you please tell me whether the specific normalization method used is linear normalization, z-score or other normalization algorithms, as this may cause slight fluctuations in the results.
We used z-transformed gene expression data.
Thank you very much. Just to be sure, you used the raw data for methylation in the experiment? Z-transformed mRNA is mrna expression z-scores relative to diploid samples or transformed by the original mrna expression?
Sorry. I checked the data we used again. In this study, we used data_mrna_affymetrix_microarray. I was confused with our other studies where we analyzed z-transformed gene expression data. For methylation, we used the raw data.
For GBM, there is plenty of microarray data, but for directly downloaded from the http://www.cbioportal.org/, LGG LUAD KIRC microarray data that is not enough. May I ask whether you downloaded the data directly from www.cbioportal.org or obtained it by other means?
For GBM, you used microarray gene expression data due to much more samples compared to RNA-Seq data. I think that GBM is the only cancer in TCGA with the smaller number of samples in RNA-Seq than microarray. For other cancers, we used RNA-Seq gene expression data. Yes, we downloaded all data from cBioPortal. After our work, cBioPortal provides more data, for example, several types of RNA-Seq expression data with different normalization methods. I will check the file names for other cancers to find out the data we used.
For other cancers, we used "data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt" for gene expression data.
Thank you. I'll try again.
Even if we aligned the raw data, I still couldn't achieve the results reported in the paper on the other data. There are still some data processing details that could make a difference, but it's not a good idea to keep bothering. Can you provide the code for your experiments on other data sets? I'll check the questions myself.
Did you make pathway images that need to be used in our model?
Do you mean that the other three datasets use different pathways than GBM? The pathway that has to be re-selected for each cancer?
The input data to our model are pathway images generated using PCA on three multi-omics data. I meant that "did you make the input pathway images?".
Yes, I use pathway images generated using PCA on three multi-omics data as input.
Files for lgg were added. Please note that the threshold for other cancers to define long-term survival is 36 months.
Thank you very much. The data you provided is indeed better than what I have been able to process. With the data provided by you, I can reach auc 86.5 after adjusting the hyperparameters. But there are still some differences from what is reported in the paper. During the modification, I noticed that the risk should be set to 36 months and changed the class weight. Does this mean that the difference in performance is only due to hyperparameters? I'll try more hyperparameters in the next few days.
Did you remove the cases who survived with the last follow-up <= 3 years in lgg? Since the analysis is based on cross validation, there is some variability in AUC at each iteration. Also it is possible that the optimized parameters are different.
Yes, to confirm the effect, I made some modifications to your code to test the effect. When changing the risk to 36 months, your code will remove the cases who survived with the last follow-up <= 3 years. For variability in AUC, the AUC for each iteration is not very convincing. I ran the same experiment as yours, but only running 5 rounds to test the hyperparameter faster.
Do you still have a record of the best hyperparameters on lgg?
In LGG, 512 nodes were used in a fully connected layer along with age as a clinical variable.
May I ask how you handle NA values?Do you eliminate the whole column, fill it with zeros or normal values?
After hyperparameter adjustment, I can use your data to train the same results in the paper, but the data I process is really different from the data you provided. The lgg data I downloaded from here used sklearn's default pca to conduct pca for each pathway individually. Could you please provide a copy of lgg data processing code? I'll check the details myself.
I'm sorry to bother you, but could you please let me know if you are still using the KEGG database version c2.cp.kegg.v5.2.symbols.gmt in the lgg experiment? Also, is the methylation data downloaded from cbioportal, specifically the lgg_tgca(http://www.cbioportal.org/study/summary?id=luad_tcga) data_methylation_hm450.txt file? I tried applying PCA using the aforementioned data in Python's scikit-learn library, but the results I obtained are slightly different from the methylation data you uploaded. I also processed other omics data in a similar manner, and there are some differences compared to the data you uploaded. Could you please inform me about any differences in your settings or provide a data processing script (even if it cannot be directly executed) to align the data processing workflow? Thank you very much.
Yes, we used c2.cp.kegg.v5.2.symbols.gmt. All TCGA data were downloaded from cBioPortal. In the data folder, a folder (LGG) was created where you can find two matlab codes.
Are the mrna and CNVS used in your LGG,LUAD,KIRC experiments logarithmic or normalized? May I ask whether the network structure and network hyperparameters are consistent with the gbm experiment?