waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
41 stars 7 forks source link

Question about CNA data #52

Closed bblodfon closed 2 years ago

bblodfon commented 2 years ago

Hi,

I would like to better understand what the CNA values are exactly and how they are transformed via simplifyTCGA() for a specific TCGA study. Is there documentation about these somewhere?

For example, check the following two matrices:

cancer_data = curatedTCGAData(diseaseCode = 'PAAD', assays = '*', version = '2.0.1', dry.run = FALSE)
cancer_data_simplified = TCGAutils::simplifyTCGA(cancer_data)

cna_snp_mat1 = t(assay(cancer_data[,,"PAAD_CNASNP-20160128"]))
cna_snp_mat2 = t(assay(cancer_data_simplified[,,"PAAD_CNASNP-20160128_simplified"]))
LiNk-NY commented 2 years ago

Hi John, @bblodfon These are Segment_Mean values and are reduced with a weightedmean function. I've updated the documentation with details. https://github.com/waldronlab/TCGAutils/commit/dd538820f8e3a83d6023b69b6e61fd1b3960e6a5

I couldn't quickly find the documentation for the Broad Firehose pipeline but I saw that https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/ has

The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.