TCGA的拷贝数变异 - Githubissues

TCGA的拷贝数变异 by 生信星球

今天是生信星球陪你的第853天

大神一句话，菜鸟跑半年。我不是大神，但我可以缩短你走弯路的半年~

就像歌儿唱的那样，如果你不知道该往哪儿走，就留在这学点生信好不好~

这里有豆豆和花花的学习历程，从新手到进阶，生信路上有你有我!

1.下载数据

TCGA数据的下载方式五花八门，选择最方便的xena咯，对于每个癌症，上面都有提供两个cnv文件，以CHOL为例：

https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CHOL.cnv.tsv.gz

https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CHOL.masked_cnv.tsv.gz

官网上的解释是：

Masked copy number segments are generated using the same method except that a filtering step is performed that removes the Y chromosome and probe sets that were previously indicated to be associated with frequent germline copy-number variation.

也就是去掉了与生殖相关的位点,就用它。

a = read.delim("TCGA-CHOL.cnv.tsv.gz")
b = read.delim("TCGA-CHOL.masked_cnv.tsv.gz")
nrow(a)
## [1] 20535
nrow(b)
## [1] 7585

看一下数据：

head(b)
##             sample Chrom     Start       End   value
## 1 TCGA-3X-AAVB-01A     1   3301765 119471954 -0.0096
## 2 TCGA-3X-AAVB-01A     1 119472129 119472300 -1.1028
## 3 TCGA-3X-AAVB-01A     1 119472845 171825498 -0.0081
## 4 TCGA-3X-AAVB-01A     1 171825768 171826101 -1.2851
## 5 TCGA-3X-AAVB-01A     1 171828591 188521368 -0.0065
## 6 TCGA-3X-AAVB-01A     1 188524154 191426270 -0.0513

里面的最后一列的segment mean,可以根据它的值大小来确定是发生了扩增还是缺失。(>0.2扩增，<0.2缺失)。

关于segment mean的解释The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

会看到这里面只有探针位置和样本编号，却没有基因名称。这个可以用R包进行注释

2.注释到基因

用到神奇Y叔的ChIPseeker包。

pos = b
pos$Chrom = paste0("chr",pos$Chrom)
library(stringr)
require(ChIPseeker)
library(org.Hs.eg.db)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
peak <- GRanges(sample = pos[,1],
                Segment_Mean = pos[,5],
                seqnames=Rle(pos[,2]),
                ranges=IRanges(pos[,3], pos[,4]), 
                strand=rep(c("*"), nrow(pos)))
peak
## GRanges object with 7585 ranges and 2 metadata columns:
##          seqnames              ranges strand |           sample Segment_Mean
##             <Rle>           <IRanges>  <Rle> |      <character>    <numeric>
##      [1]     chr1   3301765-119471954      * | TCGA-3X-AAVB-01A      -0.0096
##      [2]     chr1 119472129-119472300      * | TCGA-3X-AAVB-01A      -1.1028
##      [3]     chr1 119472845-171825498      * | TCGA-3X-AAVB-01A      -0.0081
##      [4]     chr1 171825768-171826101      * | TCGA-3X-AAVB-01A      -1.2851
##      [5]     chr1 171828591-188521368      * | TCGA-3X-AAVB-01A      -0.0065
##      ...      ...                 ...    ... .              ...          ...
##   [7581]     chrX    3236359-24887548      * | TCGA-W5-AA2W-01A       0.0117
##   [7582]     chrX   24888212-32929525      * | TCGA-W5-AA2W-01A       0.0497
##   [7583]     chrX   32930587-44396671      * | TCGA-W5-AA2W-01A       0.0121
##   [7584]     chrX   44402046-85844381      * | TCGA-W5-AA2W-01A      -0.0322
##   [7585]     chrX  85845006-155677414      * | TCGA-W5-AA2W-01A       0.0068
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths
txdb=TxDb.Hsapiens.UCSC.hg38.knownGene
peakAnno <- annotatePeak(peak, tssRegion=c(-3000, 3000),
                         TxDb=txdb, annoDb="org.Hs.eg.db")
## >> preparing features information...      2021-09-11 03:39:08 PM 
## >> identifying nearest features...        2021-09-11 03:39:09 PM 
## >> calculating distance from peak to TSS...   2021-09-11 03:39:14 PM 
## >> assigning genomic annotation...        2021-09-11 03:39:14 PM 
## >> adding gene annotation...          2021-09-11 03:40:14 PM 
## >> assigning chromosome lengths           2021-09-11 03:40:14 PM 
## >> done...                    2021-09-11 03:40:14 PM
pos_anno=as.data.frame(peakAnno)
pos_anno[1:6,c("sample","Segment_Mean","SYMBOL")]
##             sample Segment_Mean    SYMBOL
## 1 TCGA-3X-AAVB-01A      -0.0096    MIR942
## 2 TCGA-3X-AAVB-01A      -1.1028    HSD3B1
## 3 TCGA-3X-AAVB-01A      -0.0081   PDZK1P1
## 4 TCGA-3X-AAVB-01A      -1.2851      DNM3
## 5 TCGA-3X-AAVB-01A      -0.0065 LINC01699
## 6 TCGA-3X-AAVB-01A      -0.0513 LINC01351

这个就是注释的结果，可以方便的查找每个样本里的拷贝数变异情况。

参考： https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/ https://mp.weixin.qq.com/s/lEPlgjbmLxn9Z1Vsjw0RFw https://mp.weixin.qq.com/s/WMe_sLIf6sNxqGiYKqk0nA 如果因为代码看不懂，而跟不上正文的节奏，可以来找我，系统学习。以下课程都是循环开课。下一期的时间，点进去咨询微信咯

生信零基础入门学习小组

生信入门班（四周线上直播课）

数据挖掘班（医生/医学生首选）

ixxmu / mp_duty

TCGA的拷贝数变异 #1268

TCGA的拷贝数变异 by 生信星球

1.下载数据

2.注释到基因