Closed ixxmu closed 3 years ago
懒人中翻到了TCGAbiolinks
今天是生信星球陪你的第788天
大神一句话,菜鸟跑半年。我不是大神,但我可以缩短你走弯路的半年~
就像歌儿唱的那样,如果你不知道该往哪儿走,就留在这学点生信好不好~
这里有豆豆和花花的学习历程,从新手到进阶,生信路上有你有我!
豆豆写于2021.2.25
下午偶然翻到TCGAbiolinks
的说明书,看到几个相关的workshop,感觉很不错,推荐给大家
直达workshop【更新于2019-08-30】:
https://rpubs.com/tiagochst/TCGAworkshop
这个workshop的目的是:
学会从Genomic Data Commons (GDC)下载TCGA数据
如何将数据导入并分析
TCGA的数据主要存放在两个不同的数据库:
The legacy(译为:”遗留“) archive:
https://portal.gdc.cancer.gov/legacy-archive/search/f,主要存放早期的一些数据。官方的介绍是:legacy data is not actively maintained, processed, or harmonized by the GDC
Harmonized database : https://portal.gdc.cancer.gov/,也就是大家最常见的网站。它是使用了统一的流程,并使用了hg38进行处理得到的结果,数据比较新
有一篇文章也介绍了二者的不同:
Gao, Galen F., et al. “Before and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons’ Data.” Cell systems 9.1 (2019): 24-34. (https://doi.org/10.1016/j.cels.2019.06.006)
主要结论就是:
TCGA早期的大部分样本是比对到hg19(还有少部分比对到hg18),叫做legacy omics data
后来各种软件流程逐渐成熟,GDC重新进行了一次操作,比对到了hg38,叫做harmonized omics data
hg19和hg38版本得到的数据差别不大:The GRCh37 (hg19) and GRCh38 (hg38) TCGA data versions are highly concordant
Open: includes high level genomic data that is not individually identifiable, as well as most clinical and all biospecimen data elements.
Controlled: includes individually identifiable data such as low-level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements
每个样本都有一个独特的名字,叫做:barcode
图片来自:https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/
当然,每一部分还有更为详细的说明:https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables
在这里可以看到 https://portal.gdc.cancer.gov/repository
Project:比如不同的大型项目,如TCGA、TARGET
Data category:比如Transcriptome Profiling, DNA methylation, Clinical
Data type:比如Gene Expression Quantification, Isoform Expression Quantification, Methylation Beta Value
Experimental strategy:比如miRNA-Seq, RNA-Seq
Workflow Type
Platform
Access type
总而言之,它们的层级关系是:
A project => has data on several categories => each category contains several data types => each type might have been produced with different workflows, experimental strategy and platforms
SummarizedExperiment
是非常常用的,它的示意图是(来自http://bioconductor.org/packages/SummarizedExperiment/):
You can access samples metadata with
colData
function, genomics data withassays
and genomics metadata withrowRanges
.
将从Harmonized GDC database (hg38 version)下载数据
主要包括以下几个方面:
Access TCGA clinical data
Search, download and make an R object from TCGA data
RNA-seq data
DNA methylation
Download samples metadata
Download and visualize mutations
Download and visualize copy number alterations
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap") # from bioconductor
if (!requireNamespace("TCGAbiolinks", quietly = TRUE))
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") # from github
library(TCGAbiolinks)
library(MultiAssayExperiment)
library(maftools)
library(dplyr)
library(ComplexHeatmap)
GDC的临床信息有这几个来源:
XML files: original source of the data。包含更多的信息,比如radiation, drugs information, follow-ups, biospecimen, etc。下面的index数据就是XML的一个子集而已
indexed clinical: a refined clinical data that is created using the XML files. 它包含的信息是不断更新的,比如一个病人活着的时候被收集了临床信息,但是后来继续调查时不幸因病去世,这时index数据就会显示死亡;而XML中就会比较冗余(首先是记录之前还活着,然后又记录了后来死亡)
BCR Biotab: tsv files parsed from XML files
示例来自:https://portal.gdc.cancer.gov/cases/127f738d-afa1-42f6-a9a0-f33806667b84
clinical <- GDCquery_clinic("TCGA-COAD")
head(clinical)
就以前面看到的病人TCGA-AA-3562为例:
clinical %>%
dplyr::filter(submitter_id == "TCGA-AA-3562") %>%
t %>%
as.data.frame
除了上面的index信息,还可以获取基于XML的Biotab文件(tab分隔),不过需要TCGAbiolinks版本高于2.13.5
query <- GDCquery(project = "TCGA-ACC",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
> names(clinical.BCRtab.all)
[1] "clinical_radiation_acc"
[2] "clinical_omf_v4.0_acc"
[3] "clinical_patient_acc"
[4] "clinical_follow_up_v4.0_acc"
[5] "clinical_follow_up_v4.0_nte_acc"
[6] "clinical_nte_acc"
[7] "clinical_drug_acc"
# 提取其中药物治疗信息
clinical.BCRtab.all$clinical_drug_acc %>%
head %>%
as.data.frame
之前说过,GDC后来将TCGA数据重新分析了一遍,采用了统一的流程,对于RNA-Seq就是
(图片来自:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/)
然后有几种不同的RNA-Seq结果类型:
在GDCquery
中,可以这么设置参数,尤其是workflow.type
有多种选择:
data.category: “Transcriptome Profiling”
data.type: “Gene Expression Quantification”
workflow.type: “HTSeq - Counts”, “HTSeq - FPKM”, “HTSeq - FPKM-UQ”
query.exp.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
raw.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)
使用GDCquery_Maf
:download MAF aligned against hg38
# Pipelines 可选参数: muse, varscan2, somaticsniper, mutect
maf <- GDCquery_Maf("COAD", pipelines = "muse")
maf %>% head %>% as.data.frame
# 可视化
# using maftools for data summary
maftools.input <- maf %>% read.maf
# Check summary
plotmafSummary(maf = maftools.input,
rmOutlier = TRUE,
addStat = 'median',
dashboard = TRUE)
oncoplot(maf = maftools.input,
top = 10,
removeNonMutated = TRUE)
# classifies Single Nucleotide Variants into Transitions and Transversions
titv = titv(maf = maftools.input,
plot = FALSE,
useSyn = TRUE)
plotTiTv(res = titv)
# extract sample summary
getSampleSummary(maftools.input)
pipeline在:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/
Numeric focal-level Copy Number Variation (CNV) values were generated with “Masked Copy Number Segment” files from tumor aliquots using GISTIC2 on a project level.
只保留了蛋白编码基因,并且设置的CNV values noise cutoff 为0.3
Genes with focal CNV values smaller than -0.3 are categorized as a “loss” (-1)
Genes with focal CNV values larger than 0.3 are categorized as a “gain” (+1)
Genes with focal CNV values between and including -0.3 and 0.3 are categorized as “neutral” (0)
query <- GDCquery(project = "TCGA-GBM",
data.category = "Copy Number Variation",
data.type = "Gene Level Copy Number Scores",
access = "open")
GDCdownload(query)
scores <- GDCprepare(query)
scores[1:5,1:5]
可以进行可视化
scores.matrix <- scores %>%
dplyr::select(-c(1:3)) %>% # Removes metadata from the first 3 columns
as.matrix
rownames(scores.matrix) <- paste0(scores$`Gene Symbol`,"_",scores$Cytoband)
# gain in more than 100 samples
gain.more.than.hundred.samples <- which(rowSums(scores.matrix == 1) > 100)
# loss in more than 100 samples
loss.more.than.hundred.samples <- which(rowSums(scores.matrix == -1) > 100)
lines.selected <- c(gain.more.than.hundred.samples,loss.more.than.hundred.samples)
Heatmap(scores.matrix[lines.selected,],
show_column_names = FALSE,
show_row_names = TRUE,
row_names_gp = gpar(fontsize = 8),
col = circlize::colorRamp2(c(-1,0,1), colors = c("red","white","blue")))
pipeline在:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_LO_Pipeline/
数值在0-1之间,0 表示unmethylated , 1表示 fully methylated
query_met.hg38 <- GDCquery(project = "TCGA-GBM",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 27",
barcode = c("TCGA-02-0116-01A","TCGA-14-3477-01A-01D"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38,summarizedExperiment = TRUE)
data.hg38
## class: RangedSummarizedExperiment
## dim: 27578 2
## metadata(1): data_release
## assays(1): ''
## rownames(27578): cg00000292 cg00002426 ... cg27662877 cg27665659
## rowData names(7): Composite.Element.REF Gene_Symbol ...
## CGI_Coordinate Feature_Type
## colnames(2): TCGA-02-0116-01A-01D-0199-05
## TCGA-14-3477-01A-01D-0915-05
## colData names(113): sample patient ...
## subtype_Telomere.length.estimate.in.blood.normal..Kb.
## subtype_Telomere.length.estimate.in.tumor..Kb.
# 看看探针信息
library(SummarizedExperiment)
data.hg38 %>% rowRanges %>% as.data.frame
然后样本注释信息
data.hg38 %>% colData %>% as.data.frame
DNA甲基化探针信号值信息
data.hg38 %>% assay %>% head %>% as.data.frame
看看top10变化最大的探针
# plot 10 most variable probes
data.hg38 %>%
assay %>%
rowVars %>%
order(decreasing = TRUE) %>%
head(10) -> idx
pal_methylation <- colorRampPalette(c("#000436",
"#021EA9",
"#1632FB",
"#6E34FC",
"#C732D5",
"#FD619D",
"#FF9965",
"#FFD32B",
"#FFFC5A"))(100)
Heatmap(assay(data.hg38)[idx,],
show_column_names = TRUE,
show_row_names = TRUE,
name = "Methylation Beta-value",
row_names_gp = gpar(fontsize = 8),
column_names_gp = gpar(fontsize = 8),
col = circlize::colorRamp2(seq(0, 1, by = 1/99), pal_methylation))
比如:ATAC-seq Workshop:
http://rpubs.com/tiagochst/atac_seq_workshop
Workshop for multi-omics analysis with ELMER:https://rpubs.com/tiagochst/ELMER_workshop
点击底部的“阅读原文”,获得更好的阅读体验哦😻
初学生信,很荣幸带你迈出第一步
🤓生信星球 🌎~ 一个不拽术语、通俗易懂的生信知识平台
https://mp.weixin.qq.com/s/uEHzI8t_0PWhNRlOhQZYyQ