Closed ixxmu closed 3 years ago
今天是生信星球陪你的第853天
大神一句话,菜鸟跑半年。我不是大神,但我可以缩短你走弯路的半年~
就像歌儿唱的那样,如果你不知道该往哪儿走,就留在这学点生信好不好~
这里有豆豆和花花的学习历程,从新手到进阶,生信路上有你有我!
TCGA数据的下载方式五花八门,选择最方便的xena咯,对于每个癌症,上面都有提供两个cnv文件,以CHOL为例:
https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CHOL.cnv.tsv.gz
https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-CHOL.masked_cnv.tsv.gz
官网上的解释是:
Masked copy number segments are generated using the same method except that a filtering step is performed that removes the Y chromosome and probe sets that were previously indicated to be associated with frequent germline copy-number variation.
也就是去掉了与生殖相关的位点,就用它。
a = read.delim("TCGA-CHOL.cnv.tsv.gz")
b = read.delim("TCGA-CHOL.masked_cnv.tsv.gz")
nrow(a)
## [1] 20535
nrow(b)
## [1] 7585
看一下数据:
head(b)
## sample Chrom Start End value
## 1 TCGA-3X-AAVB-01A 1 3301765 119471954 -0.0096
## 2 TCGA-3X-AAVB-01A 1 119472129 119472300 -1.1028
## 3 TCGA-3X-AAVB-01A 1 119472845 171825498 -0.0081
## 4 TCGA-3X-AAVB-01A 1 171825768 171826101 -1.2851
## 5 TCGA-3X-AAVB-01A 1 171828591 188521368 -0.0065
## 6 TCGA-3X-AAVB-01A 1 188524154 191426270 -0.0513
里面的最后一列的segment mean,可以根据它的值大小来确定是发生了扩增还是缺失。(>0.2扩增,<0.2缺失)。
关于segment mean的解释The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/
会看到这里面只有探针位置和样本编号,却没有基因名称。这个可以用R包进行注释
用到神奇Y叔的ChIPseeker包。
pos = b
pos$Chrom = paste0("chr",pos$Chrom)
library(stringr)
require(ChIPseeker)
library(org.Hs.eg.db)
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
peak <- GRanges(sample = pos[,1],
Segment_Mean = pos[,5],
seqnames=Rle(pos[,2]),
ranges=IRanges(pos[,3], pos[,4]),
strand=rep(c("*"), nrow(pos)))
peak
## GRanges object with 7585 ranges and 2 metadata columns:
## seqnames ranges strand | sample Segment_Mean
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 3301765-119471954 * | TCGA-3X-AAVB-01A -0.0096
## [2] chr1 119472129-119472300 * | TCGA-3X-AAVB-01A -1.1028
## [3] chr1 119472845-171825498 * | TCGA-3X-AAVB-01A -0.0081
## [4] chr1 171825768-171826101 * | TCGA-3X-AAVB-01A -1.2851
## [5] chr1 171828591-188521368 * | TCGA-3X-AAVB-01A -0.0065
## ... ... ... ... . ... ...
## [7581] chrX 3236359-24887548 * | TCGA-W5-AA2W-01A 0.0117
## [7582] chrX 24888212-32929525 * | TCGA-W5-AA2W-01A 0.0497
## [7583] chrX 32930587-44396671 * | TCGA-W5-AA2W-01A 0.0121
## [7584] chrX 44402046-85844381 * | TCGA-W5-AA2W-01A -0.0322
## [7585] chrX 85845006-155677414 * | TCGA-W5-AA2W-01A 0.0068
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
txdb=TxDb.Hsapiens.UCSC.hg38.knownGene
peakAnno <- annotatePeak(peak, tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Hs.eg.db")
## >> preparing features information... 2021-09-11 03:39:08 PM
## >> identifying nearest features... 2021-09-11 03:39:09 PM
## >> calculating distance from peak to TSS... 2021-09-11 03:39:14 PM
## >> assigning genomic annotation... 2021-09-11 03:39:14 PM
## >> adding gene annotation... 2021-09-11 03:40:14 PM
## >> assigning chromosome lengths 2021-09-11 03:40:14 PM
## >> done... 2021-09-11 03:40:14 PM
pos_anno=as.data.frame(peakAnno)
pos_anno[1:6,c("sample","Segment_Mean","SYMBOL")]
## sample Segment_Mean SYMBOL
## 1 TCGA-3X-AAVB-01A -0.0096 MIR942
## 2 TCGA-3X-AAVB-01A -1.1028 HSD3B1
## 3 TCGA-3X-AAVB-01A -0.0081 PDZK1P1
## 4 TCGA-3X-AAVB-01A -1.2851 DNM3
## 5 TCGA-3X-AAVB-01A -0.0065 LINC01699
## 6 TCGA-3X-AAVB-01A -0.0513 LINC01351
这个就是注释的结果,可以方便的查找每个样本里的拷贝数变异情况。
参考: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/ https://mp.weixin.qq.com/s/lEPlgjbmLxn9Z1Vsjw0RFw https://mp.weixin.qq.com/s/WMe_sLIf6sNxqGiYKqk0nA 如果因为代码看不懂,而跟不上正文的节奏,可以来找我,系统学习。以下课程都是循环开课。下一期的时间,点进去咨询微信咯
copy 注视
https://mp.weixin.qq.com/s/Td-0GVoMzTlDIk1h3iKnpQ