ucscXena / wrangle

data_wrangling
9 stars 3 forks source link

generate segmented CNV segment mean distribution #3

Open jingchunzhu opened 6 years ago

jingchunzhu commented 6 years ago

TCGA CNV data comes as CNV segments. Each segment has a value, called segment mean. These values are continuous values. However biologist often refer copy number as gain and loss, discrete states. I am interested in the distribution of segment means for TCGA pan-cancer cohort, as well as individual TCGA cohort. The distribution will help me to determine the cutoff for determine copy number gain or loss.

In a diploid genome, a single-copy gain in a perfectly pure, homogeneous sample has a copy ratio of 3/2. In log2 scale, this is log2(3/2) = 0.585, and a single-copy loss is log2(1/2) = -1.0.” However, most tumors are heterogeneous (clonal tumor populations) and have some normal stroma. Therefore, the sample’s purity need to be considered so alterations are not missed.

  1. generate copy number segment mean distribution without adjust for purity
  2. generate copy number adjusted segment mean (adjusted for purity) distribution .
jingchunzhu commented 6 years ago

https://github.com/CarpeVida/bme160/blob/master/GenomicsInstitute/CNVHistogram2.py

CarpeVida commented 6 years ago

Here's the updated graphs. Still seeing a weird pattern in KIRC with multimodal distributions in many graphs.

uvm_cnv_graph ucs_cnv_graph ucec_cnv_graph thym_cnv_graph thca_cnv_graph tgct_cnv_graph stad_cnv_graph skcm_cnv_graph sarc_cnv_graph read_cnv_graph prad_cnv_graph pcpg_cnv_graph pancan_cnv_graph ov_cnv_graph meso_cnv_graph lusc_cnv_graph lung_cnv_graph luad_cnv_graph lihc_cnv_graph lgg_cnv_graph laml_cnv_graph kirp_cnv_graph kirc_cnv_graph kich_cnv_graph hnsc_cnv_graph gbmlgg_cnv_graph gbm_cnv_graph esca_cnv_graph dlbc_cnv_graph coadread_cnv_graph coad_cnv_graph chol_cnv_graph cesc_cnv_graph brca_cnv_graph blca_cnv_graph acc_cnv_graph