ucscXena / wrangle

data_wrangling
9 stars 3 forks source link

Derived dataset for each sample: fraction of genome altered by 1: copy number change 2. number of mutations #1

Open jingchunzhu opened 7 years ago

jingchunzhu commented 7 years ago

Build derived datasets: for each sample: fraction of genome altered by copy number change for each sample: number of mutations

Hi,

Would you guys be able to create track for fraction of genome altered by 1: copy number change and 2: number of mutations for each TCGA cohort or for the pan cancer? It used to be available in cBioPortal. The number of mutations per sample is still available but fraction of the genome altered by copy number is no longer available. Someone from MSKCC is working on getting that live again. Or is there a way to generate this data from downloading it form Xena and calculating it myself?

Thanks,

jingchunzhu commented 7 years ago
  1. Total mutation count (mutation burden): It is only important to know how many mutations are present. The specific mutations are not important.

  2. Fraction of genome altered by copy number (0-1): cBioPortal has calculated it as follows: The fraction of copy number altered genome = length of segments with log2 CNA value larger than 0.2 divided by the length of all segments measured. This is basically a measurement of genomic instability.

question: is there any background on the cutoff of 0.2 ?

jingchunzhu commented 7 years ago

in gbm, classify PTEN using 0.2, there is 84% samples with PTEN deletion. Is this about right?

http://dev.xenabrowser.net/heatmap/?bookmark=a05f9847421717d27d5e6fa60a67e79b

http://dev.xenabrowser.net/heatmap/?bookmark=723e4cd313b2380869a255f5dde62171

jingchunzhu commented 7 years ago

“In a diploid genome, a single-copy gain in a perfectly pure, homogeneous sample has a copy ratio of 3/2. In log2 scale, this is log2(3/2) = 0.585, and a single-copy loss is log2(1/2) = -1.0.” However, most tumors are heterogeneous (clonal tumor populations) and have some normal stroma. Therefore, the sample’s purity and heterogeneity need to be considered so alterations are not missed, meaning a lower threshold. I have also seen a lot of cancer focused publications using 0.2 as a threshold. I am guessing 0.2 is used of these reasons.

The frequency of a PTEN deletion (one or both alleles) in GBM is 89% (514/577).

duxiuju commented 6 years ago

Dear jingchunzhu, I would like to ask that 'log2 CNA value larger than 0.2' just represents the value larger than +0.2 or the absolute value larger than 0.2? Because if it only represents the value larger than +0.2, the copy numcer loss is neglected,right?