Closed ixxmu closed 1 year ago
简 介
癌症基因组图谱项目 (The Cancer Genome Atlas, TCGA) 已对 30 多种不同的癌症进行了测序,每种癌症类型的样本量超过 200 个。TCGA的数据收录在GDC (Genomic data commons) 库中,由体细胞变异组成的结果数据以 MAF 文件格式存储。
The process for modifying a protected MAF into a somatic MAF
https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/
引用
Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch. PMID: 30341162
论文
https://genome.cshlp.org/content/28/11/1747.full.pdf+html
论文 fig 1. maftools概述
表头描述可用的模块,即:可视化、设置操作、变异注释 和 分析。maftools包中为每个模块提供了一个小说明和相应的函数 (粗体、斜体)。典型的工作流:从MAF对象创建开始,要么读取MAF文件,要么将现有的注释转换为MAF对象,进而将其作为输入传递给所需的函数 (箭头)。
论文 fig 2. maftools可视化模块生成的关键图
(A) 显示ESCA (Esophageal carcinoma, 食管癌) 队列的肿瘤体细胞突变景观图 (Oncoplot displaying the somatic landscape)。基因根据其突变频率 (Mutation frequency) 排序,样本根据疾病组织学 (Histology) 排序,如注释栏(此图的底部)所示。侧栏图:显示了 MutSigCV (接受来自多个样本的WGS/WES数据,以及有关点突变、小插入/缺失和覆盖率的信息,并识别出比偶然预期的突变频率更高的基因) 估计的 log10 变换的Q值。
(A,B)分别在ESCC和EAC (食管鳞癌和食管腺癌) 中鉴定的突变特征 (Mutational signature)。Y轴表示:96个三核苷酸基序 (6*4*4=96:每1个三联体中,6种Ti/Tv中的每1种已确定,剩余2个核苷酸可随机组合) 所展示的整体特征 (Exposure of 96 trinucleotide motifs to overall signature)。图的标题显示了与验证的 COSMIC 特征和 cosine 相似值,以及所建议的原因 (Proposed etiology) 的最佳匹配。
突变特征 (Mutational Signatures) 及APOBEC等富集
突变特征 (Mutational Signatures),可参考 Nature (2013, doi:10.1038/nature12477) 中的定义。所有的癌症都是由体细胞突变引起的 (All cancers are caused by somatic mutations),目前对于该生物学过程的理解有限。而来自癌症基因组的体细胞突变目录具有已经起作用的突变过程的特征 (The catalogue of somatic mutations from a cancer genome bears the signatures of the mutational processes that have been operative)。即:细胞在成长过程中基因组不断受到内源性和外源性DNA损伤的威胁,不断发生变化,最终发生一些突变的积累;每一个突变过程都会留下一个不同的基因组标记,称为"Mutational Signatures"。通过分析来自7,042个癌症 (Primary cancers of 30 different classes; 507 from WGS and 6,535 from WES)的4,938,362个突变,并提取了20多种不同的突变特征。一些特征存在于许多癌症类型中,特别是胞苷脱氨酶APOBEC家族的特征,而其它局限于单一的癌症类型。某些特征与癌症诊断时患者的年龄、已知的诱变暴露或DNA维持缺陷有关,但许多特征是未知的。除了这些全基因组突变特征外,在许多癌症类型中还发现了位于小基因组区域“kataegis”的超突变。突变特征的研究,揭示了癌症发展背后的突变过程的多样性,对了解癌症病因、预防和治疗具有潜在意义。
COSMIC (Catalogue Of Somatic Mutations In Cancer, 癌症体细胞突变目录) 数据库记录了与人类癌症相关的驱动基因,搜集和记录了大量文献中的数百万肿瘤样本的编码区变异,以及促进癌症的所有体细胞突变的非编码区变异、基因融合、拷贝数变异和耐药基因变异。COSMIC采用人工审核的方式扩充数据库,实时追踪文献的最新信息,将与癌症相关的重要基因优先放入数据库中。COSMIC-3D可研究三维蛋白结构中的突变,及其对蛋白功能和药物的影响。
TGFBR2 (TGF-Beta Receptor Type-2) 基因编码的蛋白质是一种跨膜蛋白,具有蛋白激酶结构域,与TGF-β受体1型 (TGF-beta receptor type-1) 形成异二聚体复合物,并结合TGF-β。这种受体/配体复合物可使蛋白磷酸化,然后进入细胞核并调节与细胞增殖、细胞周期停滞、伤口愈合、免疫抑制和肿瘤发生相关的基因的转录。
论文 fig 4. 队列比较 和 蛋白结构域的富集分析
(A) EAC和ESCC之间的差异突变基因 (Differentially mutated genes between EAC and ESCC) 显示为森林图 (Forest plot)。条形表示优势比 (Odds ratio) 的95%置信区间。右侧的表格包括:在被P值突出显示的基因的突变中,各自在EAC和ESCC中的样本数量。P值表示阈值的重要性:(∗∗∗) P<0.001;(∗∗) P<0.01;费舍尔精确检验。
基因突变的互斥性与共现性
按照协同和互斥的作用模式,可将突变分为以下两类:1. Co-occurencing mutations;2. Mutually exclusive mutations。互斥的驱动基因往往共享相同的通路 (Pathway) 或 不同进展途径 (Different progression pathways,如不同肿瘤类型) 中的基因。
肿瘤中已存在的基因突变可能影响 (或在功能上协同于) 其它基因的突变。与互斥模式相反,一些驱动基因之间也可能存在协同模式,两个协同的驱动突变往往同时或相继发生,共同促进肿瘤的发展。
(C,D)分别在EAC和ESCC中频繁突变的 pfam 蛋白结构域。突出显示了前10个域。气泡大小与包含突出显示的结构域的基因数量成比例。Ion_ trans 结构域在EAC中大量突变。
(A) AML中互斥和共存的 (Exclusive and co-occurring) 基因对显示为三角矩阵。绿色表示共现 (co-occurring) 趋势,而粉色表示排它性 (Exclusiveness) 趋势。
(B) 通过CoMEt精确检验,发现AML中存在显著改变的途径,涉及NPM1、RUNX1和TP53基因,以相互排斥的方式突变(P<0.001)。
(C) TCGA AML队列中由 oncodrive 鉴定的疾病相关驱动基因(FDR<0.1)。括号内突出显示了紧密间隔突变簇 (Closely spaced mutational clusters) 的数量。
(D) MutSigCV 鉴定的ESCA中显著突变基因与泛癌驱动基因的泛癌比较。TGFBR2和ZNF750仅在食管癌中发生突变,而其它驱动因素,如TP53和CDKN2A,在整个泛癌队列中发生突变。
(E,F)条形图分别显示了基因与临床特征、FAB(French–American–British)分类和年龄组之间的相关性(P<0.05,Fisher精确检验)。条形图用突变样本与总样本的比率进行注释。误差条显示二项比率的95% CI。Y轴表示与表型相关的样本分数 (Fraction of samples associated with the phenotype)。
rm(list = ls())
# 本机R的版本
# [64-bit] C:\Program Files\R\R-4.0.4
# 本机R包的位置
.libPaths()
# "C:/Users/HP/Documents/R/win-library/4.0" "C:/Program Files/R/R-4.0.4/library"
# 设置R包的 install.packages 的下载镜像
site = "https://mirrors.tuna.tsinghua.edu.cn/CRAN"
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager", repos = site)
current_packages = rownames(installed.packages())
needed_packages = c("maftools", "BSgenome.Hsapiens.UCSC.hg38", "BSgenome.Hsapiens.UCSC.hg19")
for (i in needed_packages) {
if (!i %in% current_packages)
BiocManager::install(i, update = F, site_repository=site)
current_packages = rownames(installed.packages())
}
needed_packages2 <- c("mclust", "NMF", "barplot3d", "knitr")
for (j in needed_packages2) {
if (!j %in% current_packages)
install.packages(j)
current_packages = rownames(installed.packages())
}
library(maftools)
library(mclust)
library(NMF)
library(pheatmap)
library(barplot3d)
library(openxlsx)
帮 助
# R
??maftools
R包的帮助
maftools包的示例代码
输入文件
1. 对于 VCF 文件或一些表格,可简单地使用 vcf2maf 程序 (链接见文末):注释 VCF,排序或设置转录本的优先级,并生成 MAF 文件。
2. GATK的更新版本,如v4.0版,可使用 gakt funcotator (链接见文末) 生成 MAF 文件。
3. 如果使用 ANNOVAR (链接见文末) 进行变异注释,maftools 有一个方便的函数 annovarToMaf ,用于将表格形式的 annovar 输出文件,转换为 MAF。
必须的字段:Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type 和 Tumor_Sample_Barcode
推荐的可选字段:包含 VAF(Variant Allele Frequency)和 氨基酸变化 信息的非 MAF 特定字段 (non MAF specific fields)。
MAF 文件的完整文件规范,可以在 NCI GDC 文档页面上找到 (链接见文末)。
本文的对maftools的介绍,使用了 TCGA的 LAML 队列 (链接见文末),用其MAF 文件作为示例,展示了 maftools 的使用及应用。
maftools : Summarize, Analyze and Visualize MAF Files
Anand Mayakonda 2022-04-26
https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html
https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial
ANNOVAR
https://annovar.openbioinformatics.org/en/latest/
示例数据:TCGA的 LAML 队列
rm(list = ls())
# 本机R的版本
# [64-bit] C:\Program Files\R\R-4.0.4
# 本机R包的位置
.libPaths()
## [1] "C:/Users/HP/Documents/R/win-library/4.0"
## [2] "C:/Program Files/R/R-4.0.4/library"
# "C:/Users/HP/Documents/R/win-library/4.0" "C:/Program Files/R/R-4.0.4/library"# 设置R包的 install.packages 的下载镜像site = "https://mirrors.tuna.tsinghua.edu.cn/CRAN"
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos = site)current_packages = rownames(installed.packages())
needed_packages <- c("maftools", "BSgenome.Hsapiens.UCSC.hg38", "BSgenome.Hsapiens.UCSC.hg19")
for (i in needed_packages) { if (!i %in% current_packages) BiocManager::install(i, update = F, site_repository=site) current_packages = rownames(installed.packages())}
needed_packages2 <- c("mclust", "NMF", "barplot3d", "knitr")
for (j in needed_packages2) { if (!j %in% current_packages) install.packages(j) current_packages = rownames(installed.packages())}
library(maftools)
library(mclust)
library(NMF)
library(pheatmap)
library(barplot3d)
??maftools
设置:maftools包自带的、示例数据的位置,即TCGA LAML MAF文件。本机在:
* C:\Users\HP\Documents\R\win-library\4.0\maftools\extdata\tcga_laml.maf.gz
* 共2208行,表头、字段包含:
* Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
* End_position Strand Variant_Classification Variant_Type Reference_Allele
* Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
* Protein_Change i_TumorVAF_WU i_transcript_name
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
设置:患者生存信息、组织学 (Histology)分类等的临床信息。该文件的可选的 (Optional)
* 共200个样本:
* Tumor_Sample_Barcode FAB_classification days_to_last_followup Overall_Survival_Status
* TCGA-AB-2802 M4 365 1
* TCGA-AB-2803 M3 792 1
* TCGA-AB-2804 M3 2557 0
... ...
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools')
读入MAF文件(及临床信息)
laml.maf
## [1] "C:/Users/HP/Documents/R/win-library/4.0/maftools/extdata/tcga_laml.maf.gz"
laml.clin
## [1] "C:/Users/HP/Documents/R/win-library/4.0/maftools/extdata/tcga_laml_annot.tsv"
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
## -Reading
## -Validating
## -Silent variants: 475
## -Summarizing
## -Processing clinical data
## -Finished in 1.900s elapsed (0.360s cpu)
Silent variants:包含 同义突变/Synonymous
沉默突变即同义突变,Silent mutation;同义突变的,Synonymous
syn_vars = laml_dt[Variant_Classification %in% "Silent"]
非同义突变,non-Synonymous mutation
nsyn_vars = laml_dt[Variant_Classification %in% "Missense_Mutation"]
# str(laml)slotNames(laml)
## [1] "data" "variants.per.sample"
## [3] "variant.type.summary" "variant.classification.summary"
## [5] "gene.summary" "summary"
## [7] "maf.silent" "clinical.data"
注意:
* read.maf函数默认将 Silent variants 拆分出来了,放在 laml@maf.silent 中,而非 laml@data 中
table(laml@data$Variant_Type)
##
## DEL INS SNP
## 64 141 1527
table(laml@data$Variant_Classification)
##
## Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 52 91 10 42
## Missense_Mutation Nonsense_Mutation Splice_Site
## 1342 103 92
laml@maf.silent
laml
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build 37 NA NA
## 2: Center genome.wustl.edu NA NA
## 3: Samples 193 NA NA
# Shows sample summrygetSampleSummary(laml)
# Shows gene summarygetGeneSummary(laml)
# shows clinical data associated with samplesgetClinicalData(laml)
# Shows all fields in MAFgetFields(laml)
## [1] "Hugo_Symbol" "Entrez_Gene_Id" "Center"
## [4] "NCBI_Build" "Chromosome" "Start_Position"
## [7] "End_Position" "Strand" "Variant_Classification"
## [10] "Variant_Type" "Reference_Allele" "Tumor_Seq_Allele1"
## [13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode" "Protein_Change"
## [16] "i_TumorVAF_WU" "i_transcript_name"
将MAF文件的总结结果写到文件中
# Writes maf summary to an output file with basename lamlwrite.mafSummary(maf = laml, basename = 'laml')
属于设置操作 (Set operation):根据用户自定义,对MAF对象取子集
取部分样本
# Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933' (Printing just 5 rows for display convenience)subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'), mafObj = FALSE)[1:5]
# Same as above but return output as an MAF object (Default behaviour)subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'))
## An object of class MAF
## ID summary Mean Median
## 1: NCBI_Build 37 NA NA
## 2: Center genome.wustl.edu NA NA
## 3: Samples 2 NA NA
## 4: nGenes 34 NA NA
## 5: Frame_Shift_Ins 5 2.5 2.5
## 6: In_Frame_Ins 1 0.5 0.5
## 7: Missense_Mutation 26 13.0 13.0
## 8: Nonsense_Mutation 2 1.0 1.0
## 9: Splice_Site 1 0.5 0.5
## 10: total 35 17.5 17.5
部分基因,部分变异类型
# Select all Splice_Site mutations from DNMT3A and NPM1subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE,query = "Variant_Classification == 'Splice_Site'")
是否生成新的 MAF class
laml.sub.1 = subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = TRUE,query = "Variant_Classification == 'Splice_Site'")
部分基因,部分变异类型,部分字段
# Same as above but include only 'i_transcript_name' column in the outputsubsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE, query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name')
根据某个临床信息筛选maf子集:clinQuery参数
# Select all samples with FAB clasification M4 in clinical data laml_m4 = subsetMaf(maf = laml, clinQuery = "FAB_classification %in% 'M4'")
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
显示样本队列的体细胞突变景观图 (Oncoplot displaying the somatic landscape of sample cohort)
* 基因根据其突变频率 (Mutation frequency) 排序。
* 样本(可选)根据疾病组织学 (Histology) 排序。提现到注释栏(图的底部)所示。
* 侧栏图(可选):可显示了 MutSigCV 估计的 log10 变换的Q值。
* MutSigCV接受来自多个样本的WGS/WES数据,以及有关点突变、小插入/缺失和覆盖率的信息,并识别出比人们偶然预期的突变频率更高的基因
# oncoplot for top ten mutated genes.oncoplot(maf = laml, top = 10)
转换和颠换图 (Transition and transversion plot, Ti/Tv)显示了样本队列中SNV的分布
* 分为 6 个转换和颠换事件
* 堆积条形图(底部)显示了MAF文件中每个样本的突变谱分布 (Distribution of mutation spectra)
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)# plot titv summaryplotTiTv(res = laml.titv)
棒棒糖图 (Lollipop plot) 显示了样本队列中某个基因的 突变在蛋白上的分布,以及蛋白结构域。
* 图中可标记了*反复出现的热点突变* (Recurrent hotspots)
* 体细胞突变率 (Somatic mutation rate) 和转录本名称分别由图的标题和副标题表示
* 该函数假设了蛋白质变化信息存储在'Protein_Change'列下
* 如果需要,可使用参数 AACol 重新定义
1个基因经常包含多个不同的转录本:
* 可使用 refSeqID 或 proteinID 参数自定义某个特定的转录本名称
* # HGNC refseq.ID protein.ID aa.length
* # 1: DNMT3A NM_175629 NP_783328 912
* # 2: DNMT3A NM_022552 NP_072046 912
* # 3: DNMT3A NM_153759 NP_715640 723
* # Using longer transcript NM_175629 for now.
当某个突变的氨基酸位置,未出现在上述转录本的全长中,程序提示:
* Removed 3 mutations for which AA position was not available
# lollipop plot for one gene, which is one of the most frequent mutated gene in tumorlollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE)
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629 NP_783328 912
## 2: DNMT3A NM_022552 NP_072046 912
## 3: DNMT3A NM_153759 NP_715640 723
是否将蛋白结构域名称作为 Legend 展示
* 'labelPos'参数用于指定想要标记的突变
lollipopPlot(maf = laml, gene = 'DNMT3A', showDomainLabel = FALSE, labelPos = 882)
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629 NP_783328 912
## 2: DNMT3A NM_022552 NP_072046 912
## 3: DNMT3A NM_153759 NP_715640 723
brca <- system.file("extdata", "brca.maf.gz", package = "maftools")brca = read.maf(maf = brca, verbose = FALSE)table(brca@data$Tumor_Sample_Barcode)
##
## TCGA-A8-A08B
## 31
显示某单个样本的、高突变基因组区域的降雨图/雨点图 (Rainfall plot)
* 每个点都是根据SNV的类别进行颜色编码
* 通过变点法 (Change-point method) 识别的超高突变基因组片段,可用黑色箭头突出显示
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.6)
## Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
## 1: 8 98129348 98133560 7 702.0000 4212
## 2: 8 98398549 98403536 9 623.3750 4987
## 3: 8 98453076 98456466 9 423.7500 3390
## 4: 8 124090377 124096810 22 306.3333 6433
## 5: 12 97436055 97439705 7 608.3333 3650
## 6: 17 29332072 29336153 8 583.0000 4081
## Tumor_Sample_Barcode C>G C>T
## 1: TCGA-A8-A08B 4 3
## 2: TCGA-A8-A08B 1 8
## 3: TCGA-A8-A08B 1 8
## 4: TCGA-A8-A08B 1 21
## 5: TCGA-A8-A08B 4 3
## 6: TCGA-A8-A08B 4 4
* 将输入MAF中的突变负荷与来自MC3项目的所有33个TCGA队列进行比较
* 肿瘤突变负荷 (Tumor mutational burden, TMB)是指在患者的癌细胞中发现的基因突变数量
* 通常:高TMB的癌症更可能对免疫治疗或靶向治疗等治疗类型,有反应;而低TMB的肿瘤对这些治疗类型的反应不大
laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')
# 对突变负荷 (Mutation burden) 的差异,进行:配对t检验 (Pairwise t-test) # Warning message: # In FUN(X[[i]], ...) : Removed 1 samples with zero mutations.
基因的 VAF (Variant Allele Frequency, 等位基因变异频率) 分布
* 箱线图绘制中,抖动(jitter)中的每个点(dot),都是该基因的 1种变异
* VAF 是检测到的、与特定DNA变异匹配的测序Reads的百分比,分母是该位点的总Reads数
* VAF 是原始样本中携带变异的DNA分子比例的测量 (例如可代表血浆cfDNA中突变肿瘤DNA的百分比)
# Plots vaf distribution of genes as a boxplot. Each dot in the jitter is a variant.# pdf("Fig.箱线图绘制基因的vaf分布_抖动-jitter-中的每个点-dot-都是1个变异.pdf")plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')
# dev.off()
maftools也提供了整合和分析拷贝数变异 (CNV) 数据的选项
* CNV数据可来自:GISTIC和CBS (Circular binary segmentation)算法
* GISTIC (Genomic identification of significant targets in cancer, 癌症中重要靶点的基因组鉴定)
* GISTIC模块可识别在1组样本中显着扩增或缺失 (Amplified or Deleted)的基因组区域
* 每个异常 (Aberration) 都被分配了一个G-score,该得分考虑了异常的幅度、及跨样本发生的频率
* https://broadinstitute.github.io/gistic2/
* CBS (Circular binary segmentation, 循环二元分割算法)是常用的 SNParray 数据分段算法,使用芯片数据检测CNV,目前也适用于NGS数据
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
# GISTIC objectlaml.gistic
## An object of class GISTIC
## ID summary
## 1: Samples 191
## 2: nGenes 2622
## 3: cytoBands 16
## 4: Amp 388
## 5: Del 26481
## 6: total 26869
gisticChromPlot(gistic = laml.gistic, markBands = "all")
gisticBubblePlot(gistic = laml.gistic)
gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)
tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")plotCBSsegments(cbsFile = tcga.ab.009.seg)
## NULL
突变事件存在互斥或共存的 (Exclusive / Co-occurring)的基因对
* 三角矩阵,绿色表示共现 (Co-occurring) 趋势;金色表示排它性 (Exclusiveness) 趋势
# exclusive/co-occurance event analysis on top 10 mutated genessomaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))
鉴定肿瘤Driver/驱动基因
* maftools的1个函数oncodrive。根据oncodriveCLUST算法进行运算
* 此概念基于:大多数致癌基因的变异在少数特定位点(即热点)的富集
laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')head(laml.sig)
plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE)
laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)
# Protein summary (Printing first 7 columns for display convenience)laml.pfam$proteinSummary[,1:7, with = FALSE]
# Domain summary (Printing first 3 columns for display convenience)laml.pfam$domainSummary[,1:3, with = FALSE]
# Survival analysis based on grouping of DNMT3A mutation statusmafSurvival(maf = laml, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)
## DNMT3A
## 48
## Group medianTime N
## 1: Mutant 245 45
## 2: WT 396 137
利用前20个突变基因,识别1组基因(个数为2),来预测预后不良的分组 (Poor prognostic groups)
Hide
# Using top 20 mutated genes to identify a set of genes (of size 2) to predict poor prognostic groupsprog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)print(prog_geneset)
## Gene_combination P_value hr WT Mutant
## 1: FLT3_DNMT3A 0.00104 2.510 164 18
## 2: DNMT3A_SMC3 0.04880 2.220 176 6
## 3: DNMT3A_NPM1 0.07190 1.720 166 16
## 4: DNMT3A_TET2 0.19600 1.780 176 6
## 5: FLT3_TET2 0.20700 1.860 177 5
选择显著的,绘制生存曲线
mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")
## Group medianTime N
## 1: Mutant 242.5 18
## 2: WT 379.5 164
队列比较:比较两个队列的MAF
#Primary APL MAFprimary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")primary.apl = read.maf(maf = primary.apl)# Relapse APL MAFrelapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")relapse.apl = read.maf(maf = relapse.apl)
# Considering only genes which are mutated in at-least in 5 samples in one of the cohort to avoid bias due to genes mutated in single samplept.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)print(pt.vs.rt)
## $results
## Hugo_Symbol Primary Relapse pval or ci.up ci.low
## 1: PML 1 11 1.529935e-05 0.03537381 0.2552937 0.000806034
## 2: RARA 0 7 2.574810e-04 0.00000000 0.3006159 0.000000000
## 3: RUNX1 1 5 1.310500e-02 0.08740567 0.8076265 0.001813280
## 4: FLT3 26 4 1.812779e-02 3.56086275 14.7701728 1.149009169
## 5: ARID1B 5 8 2.758396e-02 0.26480490 0.9698686 0.064804160
队列之间的差异突变基因 (Differentially mutated genes between two cohorts) 显示为森林图 (Forest plot)
* 条形表示比值比 (Odds ratio) 的95%置信区间
* 图中相邻的2列数值:在被突出显示的基因的突变,在两组样本队列中的样本数量
* P值:(∗∗∗) P<0.001;(∗∗) P<0.01;费舍尔精确检验
forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1, color = c('royalblue', 'maroon'), geneFontSize = 0.8)
指定基因集,并排画2个oncoplot,进行队列比较
* 例如:涉及某些信号通路 (的某些基因),其相关基因的*突变途径 (Mutated pathways)*
* 与这些途径相关的基因,是否优先在其中1个队列中富集,是否以互斥的方式突变?
* eg: Mutated pathways involving genes associated with VGSC and ERBB signaling in EAC (食管腺癌).
* Genes associated with these pathways are preferentially enriched in EAC, mutated in a mutually exclusive manner.
genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)
并排画两个条形图,用于不同队列或分组的比较
coBarplot(m1 = primary.apl, m2 = relapse.apl, m1Name = "Primary", m2Name = "Relapse")
并排画两个叠加的、蛋白突变及结构域图,用于单个蛋白突变形式的、不同队列或分组的比较
lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")
每个基因在某个分组中,突变差异显著性的比较
fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')
##
## M0 M1 M2 M3 M4 M5 M6 M7
## 19 44 44 21 39 19 3 3
# Results are returned as a list. Significant associations p-value < 0.05fab.ce$groupwise_comparision[p_value < 0.05]
Hugo_Symbol <chr> | Group1 <chr> | Group2 <chr> | n_mutated_group1 <chr> | n_mutated_group2 <chr> | p_value <dbl> | OR_low <dbl> | |
---|---|---|---|---|---|---|---|
IDH1 | M1 | Rest | 11 of 44 | 7 of 149 | 0.0002597371 | 0 | |
TP53 | M7 | Rest | 3 of 3 | 12 of 190 | 0.0003857187 | 0 | |
DNMT3A | M5 | Rest | 10 of 19 | 38 of 174 | 0.0057610493 | 0 | |
CEBPA | M2 | Rest | 7 of 44 | 6 of 149 | 0.0117352110 | 0 | |
RUNX1 | M0 | Rest | 5 of 19 | 11 of 174 | 0.0117436825 | 0 | |
NPM1 | M5 | Rest | 7 of 19 | 26 of 174 | 0.0248582372 | 0 | |
CEBPA | M1 | Rest | 6 of 44 | 7 of 149 | 0.0478737468 | 0 |
plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05)
检查药物-基因相互作用和药物类别
dgi = drugInteractions(maf = laml, fontSize = 0.75)
dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)
## Number of claimed drugs for given genes:
## Gene N
## 1: DNMT3A 7
# Printing selected columnsdnmt3a.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)]
OncogenicPathways(maf = laml)
PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")
Variant Allele Frequencies, VAF
# Heterogeneity in sample TCGA.AB.2972# install.packages("mclust")# library("mclust")# tsb参数:指定必须为其进行聚类的样本名称 (tumor_sample_barcode) # specify sample names (Tumor_Sample_Barcodes) for which clustering has to be donetable('TCGA-AB-2972' %in% laml@data$Tumor_Sample_Barcode)
##
## TRUE
## 1
tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')
print(tcga.ab.2972.het$clusterMeans)
## Tumor_Sample_Barcode cluster meanVaf
## 1: TCGA-AB-2972 2 0.4496571
## 2: TCGA-AB-2972 1 0.2454750
## 3: TCGA-AB-2972 outlier 0.3695000
# Visualizing resultsplotClusters(clusters = tcga.ab.2972.het)
读入CBS segmented的CNV文件
列名应该是Sample,染色体,Start, End, Num_Probes, Segment_Mean (log2 scale度)
seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')
## Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
## 1: PHF6 23 133551224 133551224 TCGA-AB-3009
## t_vaf Segment_Start Segment_End Segment_Mean CN
## 1: 0.9349112 NA NA NA NA
# Visualizing results. Highlighting those variants on copynumber altered variantsplotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)
突变特征 (Mutational Signatures)
Nature (2013, doi:10.1038/nature12477)中的定义:
* 所有的癌症都是由体细胞突变引起的 (All cancers are caused by somatic mutations),目前对于该生物学过程的理解有限。而来自癌症基因组的体细胞突变目录 (Catalogue),具有已经起作用的突变过程的特征 (The catalogue of somatic mutations from a cancer genome bears the signatures of the mutational processes that have been operative)。
* 即:细胞在成长过程中基因组不断受到内源性和外源性DNA损伤的威胁,不断发生变化,最终发生一些突变的积累;每一个突变过程都会留下一个不同的基因组标记,称为"Mutational Signatures"。
* 通过分析来自7,042个癌症(Primary cancers of 30 different classes; 507 from WGS and 6,535 from WES)的4,938,362个突变,并提取了20多种不同的突变特征。
* 一些存在于许多癌症类型中,特别是胞苷脱氨酶APOBEC家族的特征,而其它局限于单一的癌症类型。
* 某些特征与癌症诊断时患者的年龄、已知的诱变暴露或DNA维持缺陷有关,但许多特征是未知的。
* 除了这些全基因组突变特征外,在许多癌症类型中还发现了位于小基因组区域“kataegis”的超突变。
* 突变特征的研究,揭示了癌症发展背后的突变过程的多样性,对了解癌症病因、预防和治疗具有潜在意义。
APOBEC富集
* 在突变位点的两侧提取单个5'和3'碱基,进行de-novo特征分析,同时估计APOBEC富集分数。
* APOBEC3G蛋白属APOBEC家族成员,又称为CEM15蛋白,是一种胞嘧啶脱氨酶。
* APOBEC家族成员均被证明对单链DNA/RNA上的胞苷有脱氨基作用,可通过诱发靶基因上的胞嘧啶脱氨基成为尿嘧啶(C>U) 参与大部分基因突变。
# Requires BSgenome objectlibrary(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
UCSC提供的人全基因组序列(eg: hg19 / GRCh37.p13),存储在Biostrings对象中
* 这个BSgenome数据包由以下源数据文件制成: hg19.2bit (二进制,下载自:https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/latest/ , March 24, 2020)
laml.tnm = trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in 3.315 % of samples (APOBEC enrichment score > 2 ; 6 of 181 samples)
## -Creating mutation matrix
## --matrix of dimension 188x96
绘图展示APOBEC富集和非APOBEC富集样本之间的差异
plotApobecDiff(tnm = laml.tnm, maf = laml, pVal = 0.2)
## $results
## Hugo_Symbol Enriched nonEnriched pval or ci.up
## 1: TP53 2 13 0.08175632 5.9976455 46.608861
## 2: TET2 1 16 0.45739351 1.9407002 18.983979
## 3: FLT3 2 45 0.65523131 1.4081851 10.211621
## 4: DNMT3A 1 47 1.00000000 0.5335362 4.949499
## 5: ADAM11 0 2 1.00000000 0.0000000 164.191472
# Plots differences between APOBEC enriched and non-APOBEC enriched samples
根据共表型相关度量 (Cophenetic correlation metric) 估计癌症特征(Signatures)的数量
可使用最大至6个Signatures
# library('NMF')# laml.sign = estimateSignatures(mat = laml.tnm, nTry = 6)# Run main function with maximum 6 signatureslaml.sign = estimateSignatures(mat = laml.tnm, nTry = 6, pConstant = 0.1, plotBestFitRes = FALSE, parallel = 2)
## -Running NMF for 6 ranks
## Compute NMF rank= 2 ... + measures ... OK
## Compute NMF rank= 3 ... + measures ... OK
## Compute NMF rank= 4 ... + measures ... OK
## Compute NMF rank= 5 ... + measures ... OK
## Compute NMF rank= 6 ... + measures ... OK
## -Finished in 00:01:00 elapsed (12.0s cpu)
plotCophenetic(res = laml.sign)
COSMIC数据库
* COSMIC (Catalogue Of Somatic Mutations In Cancer, 癌症体细胞突变目录) 数据库记录了与人类癌症相关的驱动基因,搜集和记录了大量文献中的数百万肿瘤样本的编码区变异,以及促进癌症的所有体细胞突变的非编码区变异、基因融合、拷贝数变异和耐药基因变异。
* COSMIC采用人工审核的方式扩充数据库,实时追踪文献的最新信息,将与癌症相关的重要基因优先放入数据库中。
* COSMIC-3D可研究三维蛋白结构中的突变,及其对蛋白功能和药物的影响。
# laml.sig = extractSignatures(mat = laml.tnm, n = 3)laml.sig = extractSignatures(mat = laml.tnm, n = 3, pConstant = 0.1, parallel = 2)
与原始的 30 个 signatures 比较 (COSMIC signatures)
# laml.og30.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "legacy")
与更新的 60 个 signatures 比较 (version3, COSMIC signatures)
# laml.v3.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "SBS")# save(file = "compareSignatures.RData", list = ls())
rm(list = ls())load("compareSignatures.RData")pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")
pheatmap::pheatmap(mat = laml.v3.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")
maftools::plotSignatures(nmfRes = laml.sig, title_size = 0.8, sig_db = "SBS")
3D条形图展示第1个Signature
# Visualize first signaturesig1 = laml.sig$signatures[,1]barplot3d::legoplot3d(contextdata = sig1, labels = FALSE, scalexy = 0.01, sixcolors = "sanger", alpha = 0.5)# colnames(laml.sign$contributions) = as.character(getSampleSummary(x = laml)[,Tumor_Sample_Barcode])[1:187]
laml.se = signatureEnrichment(maf = laml, sig_res = laml.sig)
##
## Signature_1 Signature_2 Signature_3
## 60 65 63
plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)
var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools")var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19', tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene')
## -Reading annovar files
## -Processing Exonic variants
## --Adding Variant_Classification
## --Parsing aa-change
## -Processing Non-exonic variants
## --Adding Variant_Classification
## -Adding Variant_Type
## -Converting Ensemble Gene IDs into HGNC gene symbols
## --Done. Original ensemble gene IDs are preserved under field name ens_id
## Finished in 1.100s elapsed (0.210s cpu)
# Read sample ICGC data for ESCAesca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools")esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE)
## --Removed 427 duplicated variants
#Printing first 16 columns for display convenienceprint(esca.maf[1:5,1:16, with = FALSE])
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
## 1: AC005237.4 NA NA GRCh37 2 241987787
## 2: AC061992.1 786 NA GRCh37 17 76425382
## 3: AC097467.2 NA NA GRCh37 4 156294567
## 4: ADAMTS12 NA NA GRCh37 5 33684019
## 5: AL589642.1 54801 NA GRCh37 9 32630154
基因名称 (如曾用名) 的转换
laml.mutsig.corrected = prepareMutSig(maf = laml)# Converting gene names for 1 variants from 1 genes# Hugo_Symbol MutSig_Synonym N# 1: ARHGAP35 GRLF1 1# Original symbols are preserved under column OG_Hugo_Symbol.
输出本分析流程中,各个软件的版本
# devtools::install_github(repo = "PoisonAlien/TCGAmutations")sessionInfo()
maftools : Summarize, Analyze and Visualize MAF Files
Anand Mayakonda 2022-04-26
https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html
https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial
ANNOVAR
https://annovar.openbioinformatics.org/en/latest/
示例数据:TCGA的 LAML 队列
https://mp.weixin.qq.com/s/zttX7tKl9uVPLtHrXqmRYg