一文学会R包 maftools 简介使用 by 生信宝典

简介

随着癌症基因组学的进步，突变注释格式 (Mutation Annotation Format, MAF) 被广泛接受，并用于存储检测到的体细胞变异。

... ...

GDC数据库的MAF文件的一些列

https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

癌症基因组图谱项目 (The Cancer Genome Atlas, TCGA) 已对 30 多种不同的癌症进行了测序，每种癌症类型的样本量超过 200 个。TCGA的数据收录在GDC (Genomic data commons) 库中，由体细胞变异组成的结果数据以 MAF 文件格式存储。

GDC对体细胞突变MAF文件的处理过程

The process for modifying a protected MAF into a somatic MAF

https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF Files and are generated on a project-level. MAF files are produced through the Somatic Aggregation Workflow. The GDC produces MAF files at two permission levels: protected and somatic (or open-access). One MAF file is produced per variant calling pipeline per GDC project. MAFs are produced by aggregating the GDC annotated VCF files generated from one pipeline for one project.

只要数据为 MAF 格式，maftools包将尝试以有效的方式从 TCGA 或任何内部研究中总结、分析、注释和可视化 MAF 文件。

引用

Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch. PMID: 30341162

论文

https://genome.cshlp.org/content/28/11/1747.full.pdf+html

Anand Mayakonda, http://poisonalien.github.io/

论文 fig 1. maftools概述

表头描述可用的模块，即：可视化、设置操作、变异注释和分析。maftools包中为每个模块提供了一个小说明和相应的函数 (粗体、斜体)。典型的工作流：从MAF对象创建开始，要么读取MAF文件，要么将现有的注释转换为MAF对象，进而将其作为输入传递给所需的函数 (箭头)。

1) 可视化模块 (Visualization)：包括从输入的MAF对象生成可发表的图的函数；

2) 分析模块 (Analysis)：提供了在癌症基因组学中执行常用分析的函数；

3) 变异注释模块 (Variant annotation)：使用 oncontator API 执行变异注释，对ANNOVAR等程序生成的注释进行格式转换；

4) 设置操作 (Set operation )：包括根据用户自定义，对MAF对象 取子集 的函数。

论文 fig 2. maftools可视化模块生成的关键图

（A）显示ESCA (Esophageal carcinoma, 食管癌) 队列的肿瘤体细胞突变景观图 (Oncoplot displaying the somatic landscape)。基因根据其突变频率 (Mutation frequency) 排序，样本根据疾病组织学 (Histology) 排序，如注释栏（此图的底部）所示。侧栏图：显示了 MutSigCV (接受来自多个样本的WGS/WES数据，以及有关点突变、小插入/缺失和覆盖率的信息，并识别出比偶然预期的突变频率更高的基因) 估计的 log10 变换的Q值。

（B） 转换和颠换图 (Transition and transversion plot, Ti/Tv) 显示了ESCA中SNV的分布，分为 6 个转换和颠换事件。堆积条形图（底部）显示了MAF文件中每个样本的突变谱分布。

（C） 棒棒糖图 (Lollipop plot) 显示了ESCA中 TP53 的突变在蛋白上的分布、及蛋白结构域。图中标记了反复出现的热点突变 (Recurrent hotspots)。图的标题和副标题：体细胞突变率 (Somatic mutation rate) 和转录本名称。

（D） TCGA乳腺癌单个样本 TCGA-A8-A08B 的降雨图/雨点图 (Rainfall plot)。每个点都是根据SNV的类别 (转换和颠换，Ti/Tv) 进行颜色编码。通过变点法 (Change-point method) 识别的超高突变基因组片段，用黑色箭头突出显示。

论文 fig 3. 突变特征和APOBEC富集分析

（A，B）分别在ESCC和EAC (食管鳞癌和食管腺癌) 中鉴定的突变特征 (Mutational signature)。Y轴表示：96个三核苷酸基序 (6*4*4=96：每1个三联体中，6种Ti/Tv中的每1种已确定，剩余2个核苷酸可随机组合) 所展示的整体特征 (Exposure of 96 trinucleotide motifs to overall signature)。图的标题显示了与验证的 COSMIC 特征和 cosine 相似值，以及所建议的原因 (Proposed etiology) 的最佳匹配。

突变特征 (Mutational Signatures) 及APOBEC等富集

突变特征 (Mutational Signatures)，可参考 Nature (2013, doi:10.1038/nature12477) 中的定义。所有的癌症都是由体细胞突变引起的 (All cancers are caused by somatic mutations)，目前对于该生物学过程的理解有限。而来自癌症基因组的体细胞突变目录具有已经起作用的突变过程的特征 (The catalogue of somatic mutations from a cancer genome bears the signatures of the mutational processes that have been operative)。即：细胞在成长过程中基因组不断受到内源性和外源性DNA损伤的威胁，不断发生变化，最终发生一些突变的积累；每一个突变过程都会留下一个不同的基因组标记，称为"Mutational Signatures"。通过分析来自7,042个癌症 (Primary cancers of 30 different classes; 507 from WGS and 6,535 from WES)的4,938,362个突变，并提取了20多种不同的突变特征。一些特征存在于许多癌症类型中，特别是胞苷脱氨酶APOBEC家族的特征，而其它局限于单一的癌症类型。某些特征与癌症诊断时患者的年龄、已知的诱变暴露或DNA维持缺陷有关，但许多特征是未知的。除了这些全基因组突变特征外，在许多癌症类型中还发现了位于小基因组区域“kataegis”的超突变。突变特征的研究，揭示了癌症发展背后的突变过程的多样性，对了解癌症病因、预防和治疗具有潜在意义。

在突变位点的两侧提取单个5'和3'碱基 (形成：三核苷酸基序 / Trinucleotide motifs)，进行de-novo特征分析，同时估计APOBEC富集分数。APOBEC3G蛋白属APOBEC家族成员，又称为CEM15蛋白，是一种胞嘧啶脱氨酶；APOBEC家族成员均被证明对单链DNA/RNA上的胞苷有脱氨基作用，可通过诱发靶基因上的胞嘧啶脱氨基成为尿嘧啶(C>U) 参与大部分基因突变，从而形成1种特殊的突变特征 (Mutational signature)，体现为图中6种不同颜色柱形图的特征。“Defective DNA mismatch repair”形成另1种突变特征。

COSMIC (Catalogue Of Somatic Mutations In Cancer, 癌症体细胞突变目录) 数据库记录了与人类癌症相关的驱动基因，搜集和记录了大量文献中的数百万肿瘤样本的编码区变异，以及促进癌症的所有体细胞突变的非编码区变异、基因融合、拷贝数变异和耐药基因变异。COSMIC采用人工审核的方式扩充数据库，实时追踪文献的最新信息，将与癌症相关的重要基因优先放入数据库中。COSMIC-3D可研究三维蛋白结构中的突变，及其对蛋白功能和药物的影响。

（C）来自ESCC中APOBEC富集分析的结果。箱线图显示了APOBEC富集样本和非富集样本之间的突变负荷差异：(∗∗∗) Wilcoxon秩和检验；P<0.001。甜甜圈图显示了tCw背景 (tCw context) 下的突变比例。条形图显示了APOBEC富集和非APOBEC富集样品之间的、前10个差异突变基因 (Top 10 differentially mutated genes between APOBEC-enriched and non-APOBEC enriched samples) ：(∗∗∗) P<0.001；(∗∗) P<0.01；(∗) P<0.05，Fisher精确检验。

（D）与DNA错配修复特征缺陷 (Defective MMR, COSMIC Signature 6) 相关的ESCC样本中富集的 TGFBR2 突变。Enriched TGFBR2 mutations in ESCC among samples associated with defects in DNA mismatch repair signature (16% vs. 0% Rest). 样本数量分别为：56 vs. 40，即56个样本符合COSMIC Signature 6 (即Defective MMR) 的特征。

TGFBR2 (TGF-Beta Receptor Type-2) 基因编码的蛋白质是一种跨膜蛋白，具有蛋白激酶结构域，与TGF-β受体1型 (TGF-beta receptor type-1) 形成异二聚体复合物，并结合TGF-β。这种受体/配体复合物可使蛋白磷酸化，然后进入细胞核并调节与细胞增殖、细胞周期停滞、伤口愈合、免疫抑制和肿瘤发生相关的基因的转录。

（E）棒棒糖图显示 TGFBR2 中的突变分布，及其 BETR2 结构域中的截断突变 (Truncating mutations：frameshift deletions 和 splice site)。

论文 fig 4. 队列比较和蛋白结构域的富集分析

（A） EAC和ESCC之间的差异突变基因 (Differentially mutated genes between EAC and ESCC) 显示为森林图 (Forest plot)。条形表示优势比 (Odds ratio) 的95%置信区间。右侧的表格包括：在被P值突出显示的基因的突变中，各自在EAC和ESCC中的样本数量。P值表示阈值的重要性：(∗∗∗) P<0.001；(∗∗) P<0.01；费舍尔精确检验。

（B）EAC和ESCC中，涉及与VGSC (电压门钠离子通道) 和ERBB信号相关基因的突变途径 (Mutated pathways)。可见：与这些途径相关的基因优先在EAC中富集，并以互斥的方式突变。

基因突变的互斥性与共现性

按照协同和互斥的作用模式，可将突变分为以下两类：1. Co-occurencing mutations；2. Mutually exclusive mutations。互斥的驱动基因往往共享相同的通路 (Pathway) 或不同进展途径 (Different progression pathways，如不同肿瘤类型) 中的基因。

肿瘤中已存在的基因突变可能影响 (或在功能上协同于) 其它基因的突变。与互斥模式相反，一些驱动基因之间也可能存在协同模式，两个协同的驱动突变往往同时或相继发生，共同促进肿瘤的发展。

相关性角度：Co-occurrence and mutual exclusivity (COME) of DNA mutation refer to two or more genes that tend to be positively or negatively correlated in DNA mutation among different samples.

功能协作性角度：共发生的驱动基因 (Co-occurring driver genes) 通常会激活互相合作、协作的致癌途径 (Collaborating oncogenic pathways)。PMID: 34031014, DOI: 10.1016/j.trecan.2021.04.009。互斥发生的驱动基因，二者可相互独立地导致对某个生物过程的影响，而无协同与互作的必要。

信号通路的角度：共发生的突变往往涉及平行信号通路 (Parallel signaling pathways) 中的基因；而互斥的突变可能涉及共同通路 (Common pathway，如同一个家族中的基因，但也不绝对)，或不同进展途径中的基因 (Different progression pathways，如不同肿瘤类型中的基因)。互斥也可能涉及合成致死基因 (Genes that are synthetically lethal)。不过这些解释只是"hand waving arguments"。参与肿瘤发生的生物途径复杂、且相互关联，很难定义信号途径的边界，以及平行或共同途径的概念。Cancer Res (2015) 75 (19): 4042–4052. https://doi.org/10.1158/0008-5472.CAN-15-0602

（C，D）分别在EAC和ESCC中频繁突变的 pfam 蛋白结构域。突出显示了前10个域。气泡大小与包含突出显示的结构域的基因数量成比例。Ion_ trans 结构域在EAC中大量突变。

论文 fig 5. 途径/Pathway和临床富集分析

（A） AML中互斥和共存的 (Exclusive and co-occurring) 基因对显示为三角矩阵。绿色表示共现 (co-occurring) 趋势，而粉色表示排它性 (Exclusiveness) 趋势。

（B）通过CoMEt精确检验，发现AML中存在显著改变的途径，涉及NPM1、RUNX1和TP53基因，以相互排斥的方式突变（P<0.001）。

（C） TCGA AML队列中由 oncodrive 鉴定的疾病相关驱动基因（FDR<0.1）。括号内突出显示了紧密间隔突变簇 (Closely spaced mutational clusters) 的数量。

（D） MutSigCV 鉴定的ESCA中显著突变基因与泛癌驱动基因的泛癌比较。TGFBR2和ZNF750仅在食管癌中发生突变，而其它驱动因素，如TP53和CDKN2A，在整个泛癌队列中发生突变。

（E，F）条形图分别显示了基因与临床特征、FAB（French–American–British）分类和年龄组之间的相关性（P<0.05，Fisher精确检验）。条形图用突变样本与总样本的比率进行注释。误差条显示二项比率的95% CI。Y轴表示与表型相关的样本分数 (Fraction of samples associated with the phenotype)。

安装

rm(list = ls())
# 本机R的版本# [64-bit] C:\Program Files\R\R-4.0.4
# 本机R包的位置.libPaths()# "C:/Users/HP/Documents/R/win-library/4.0" "C:/Program Files/R/R-4.0.4/library"
# 设置R包的 install.packages 的下载镜像site = "https://mirrors.tuna.tsinghua.edu.cn/CRAN"
if (!requireNamespace("BiocManager", quietly = TRUE))  install.packages("BiocManager", repos = site)
current_packages = rownames(installed.packages())
needed_packages = c("maftools", "BSgenome.Hsapiens.UCSC.hg38", "BSgenome.Hsapiens.UCSC.hg19")
for (i in needed_packages) {  if (!i %in% current_packages)    BiocManager::install(i, update = F, site_repository=site)  current_packages = rownames(installed.packages())}
needed_packages2 <- c("mclust", "NMF", "barplot3d", "knitr")
for (j in needed_packages2) {  if (!j %in% current_packages)    install.packages(j)    current_packages = rownames(installed.packages())}
library(maftools)library(mclust)library(NMF)library(pheatmap)library(barplot3d)library(openxlsx)

帮助

# R??maftools

R包的帮助

maftools包的整体介绍

maftools包的示例代码

输入文件

生成 MAF 文件

1. 对于 VCF 文件或一些表格，可简单地使用 vcf2maf 程序 (链接见文末)：注释 VCF，排序或设置转录本的优先级，并生成 MAF 文件。
2. GATK的更新版本，如v4.0版，可使用 gakt funcotator (链接见文末) 生成 MAF 文件。
3. 如果使用 ANNOVAR (链接见文末) 进行变异注释，maftools 有一个方便的函数 annovarToMaf ，用于将表格形式的 annovar 输出文件，转换为 MAF。

MAF文件中需要的字段 (Field，即：文件中的“列”)

MAF 文件包含许多字段，从染色体名称，到 cosmic 注释等。但 maftools 中的大多数分析都使用以下字段。

必须的字段：Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type 和 Tumor_Sample_Barcode
推荐的可选字段：包含 VAF（Variant Allele Frequency）和 氨基酸变化 信息的非 MAF 特定字段 (non MAF specific fields)。

MAF 文件的完整文件规范，可以在 NCI GDC 文档页面上找到 (链接见文末)。

本文的对maftools的介绍，使用了 TCGA的 LAML 队列 (链接见文末)，用其MAF 文件作为示例，展示了 maftools 的使用及应用。

资料来源

maftools : Summarize, Analyze and Visualize MAF Files

Anand Mayakonda 2022-04-26

https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html

Package ‘maftools’ August 11, 2022

https://master.bioconductor.org/packages/release/bioc/manuals/maftools/man/maftools.pdf

Mutation Annotation Format, MAF

https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

vcf2maf 程序

https://github.com/mskcc/vcf2maf

gakt funcotator 程序

https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial

ANNOVAR

https://annovar.openbioinformatics.org/en/latest/

示例数据：TCGA的 LAML 队列

Cancer Genome Atlas Research, N. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-74 (2013).

初始化，准备R包

rm(list = ls())
# 本机R的版本
# [64-bit] C:\Program Files\R\R-4.0.4
# 本机R包的位置
.libPaths()

## [1] "C:/Users/HP/Documents/R/win-library/4.0"
## [2] "C:/Program Files/R/R-4.0.4/library"

# "C:/Users/HP/Documents/R/win-library/4.0" "C:/Program Files/R/R-4.0.4/library"# 设置R包的 install.packages 的下载镜像site = "https://mirrors.tuna.tsinghua.edu.cn/CRAN"

if (!requireNamespace("BiocManager", quietly = TRUE))  install.packages("BiocManager", repos = site)current_packages = rownames(installed.packages())

needed_packages <- c("maftools", "BSgenome.Hsapiens.UCSC.hg38", "BSgenome.Hsapiens.UCSC.hg19")

for (i in needed_packages) {  if (!i %in% current_packages)    BiocManager::install(i, update = F, site_repository=site)  current_packages = rownames(installed.packages())}

needed_packages2 <- c("mclust", "NMF", "barplot3d", "knitr")

for (j in needed_packages2) {  if (!j %in% current_packages)    install.packages(j)    current_packages = rownames(installed.packages())}

library(maftools)
library(mclust)
library(NMF)
library(pheatmap)
library(barplot3d)

查看帮助文档

??maftools

读入maf文件及临床信息

设置：maftools包自带的、示例数据的位置，即TCGA LAML MAF文件。本机在：

* C:\Users\HP\Documents\R\win-library\4.0\maftools\extdata\tcga_laml.maf.gz

* 共2208行，表头、字段包含：

* Hugo_Symbol   Entrez_Gene_Id  Center  NCBI_Build  Chromosome  Start_Position
* End_position  Strand  Variant_Classification  Variant_Type    Reference_Allele
* Tumor_Seq_Allele1  Tumor_Seq_Allele2  Tumor_Sample_Barcode
* Protein_Change       i_TumorVAF_WU        i_transcript_name

laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')

设置：患者生存信息、组织学 (Histology)分类等的临床信息。该文件的可选的 (Optional)

* 共200个样本：

* Tumor_Sample_Barcode FAB_classification   days_to_last_followup  Overall_Survival_Status
* TCGA-AB-2802      M4   365     1
* TCGA-AB-2803      M3   792     1
* TCGA-AB-2804      M3   2557    0
... ...

laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools')

读入MAF文件(及临床信息)

laml.maf

## [1] "C:/Users/HP/Documents/R/win-library/4.0/maftools/extdata/tcga_laml.maf.gz"

laml.clin

## [1] "C:/Users/HP/Documents/R/win-library/4.0/maftools/extdata/tcga_laml_annot.tsv"

laml = read.maf(maf = laml.maf, clinicalData = laml.clin)

## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## -Finished in 1.900s elapsed (0.360s cpu)

Silent variants：包含同义突变/Synonymous

沉默突变即同义突变，Silent mutation；同义突变的，Synonymous
    syn_vars  = laml_dt[Variant_Classification %in% "Silent"]

非同义突变，non-Synonymous mutation
    nsyn_vars = laml_dt[Variant_Classification %in% "Missense_Mutation"]

# str(laml)slotNames(laml)

## [1] "data"                           "variants.per.sample"           
## [3] "variant.type.summary"           "variant.classification.summary"
## [5] "gene.summary"                   "summary"                       
## [7] "maf.silent"                     "clinical.data"

注意：

* read.maf函数默认将 Silent variants 拆分出来了，放在 laml@maf.silent 中，而非 laml@data 中

table(laml@data$Variant_Type)

## 
##  DEL  INS  SNP 
##   64  141 1527

table(laml@data$Variant_Classification)

## 
##   Frame_Shift_Del   Frame_Shift_Ins      In_Frame_Del      In_Frame_Ins 
##                52                91                10                42 
## Missense_Mutation Nonsense_Mutation       Splice_Site 
##              1342               103                92

laml@maf.silent

MAF文件概要

laml

## An object of class  MAF 
##                    ID          summary  Mean Median
##  1:        NCBI_Build               37    NA     NA
##  2:            Center genome.wustl.edu    NA     NA
##  3:           Samples              193    NA     NA

# Shows sample summrygetSampleSummary(laml)

# Shows gene summarygetGeneSummary(laml)

# shows clinical data associated with samplesgetClinicalData(laml)

# Shows all fields in MAFgetFields(laml)

##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"         "Center"                
##  [4] "NCBI_Build"             "Chromosome"             "Start_Position"        
##  [7] "End_Position"           "Strand"                 "Variant_Classification"
## [10] "Variant_Type"           "Reference_Allele"       "Tumor_Seq_Allele1"     
## [13] "Tumor_Seq_Allele2"      "Tumor_Sample_Barcode"   "Protein_Change"        
## [16] "i_TumorVAF_WU"          "i_transcript_name"

将MAF文件的总结结果写到文件中

# Writes maf summary to an output file with basename lamlwrite.mafSummary(maf = laml, basename = 'laml')

对MAF对象取子集

属于设置操作 (Set operation)：根据用户自定义，对MAF对象取子集

取部分样本

# Extract data for samples 'TCGA.AB.3009' and 'TCGA.AB.2933' (Printing just 5 rows for display convenience)subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'), mafObj = FALSE)[1:5]

# Same as above but return output as an MAF object (Default behaviour)subsetMaf(maf = laml, tsb = c('TCGA-AB-3009', 'TCGA-AB-2933'))

## An object of class  MAF 
##                    ID          summary Mean Median
##  1:        NCBI_Build               37   NA     NA
##  2:            Center genome.wustl.edu   NA     NA
##  3:           Samples                2   NA     NA
##  4:            nGenes               34   NA     NA
##  5:   Frame_Shift_Ins                5  2.5    2.5
##  6:      In_Frame_Ins                1  0.5    0.5
##  7: Missense_Mutation               26 13.0   13.0
##  8: Nonsense_Mutation                2  1.0    1.0
##  9:       Splice_Site                1  0.5    0.5
## 10:             total               35 17.5   17.5

部分基因，部分变异类型

# Select all Splice_Site mutations from DNMT3A and NPM1subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE,query = "Variant_Classification == 'Splice_Site'")

是否生成新的 MAF class

laml.sub.1 = subsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = TRUE,query = "Variant_Classification == 'Splice_Site'")

部分基因，部分变异类型，部分字段

# Same as above but include only 'i_transcript_name' column in the outputsubsetMaf(maf = laml, genes = c('DNMT3A', 'NPM1'), mafObj = FALSE, query = "Variant_Classification == 'Splice_Site'", fields = 'i_transcript_name')

根据某个临床信息筛选maf子集：clinQuery参数

# Select all samples with FAB clasification M4 in clinical data laml_m4 = subsetMaf(maf = laml, clinQuery = "FAB_classification %in% 'M4'")

MAF汇总图

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

体细胞突变景观图

显示样本队列的体细胞突变景观图 (Oncoplot displaying the somatic landscape of sample cohort)

* 基因根据其突变频率 (Mutation frequency) 排序。

* 样本(可选)根据疾病组织学 (Histology) 排序。提现到注释栏（图的底部）所示。

* 侧栏图(可选)：可显示了 MutSigCV 估计的 log10 变换的Q值。
    * MutSigCV接受来自多个样本的WGS/WES数据，以及有关点突变、小插入/缺失和覆盖率的信息，并识别出比人们偶然预期的突变频率更高的基因

# oncoplot for top ten mutated genes.oncoplot(maf = laml, top = 10)

转换和颠换变异分布

转换和颠换图 (Transition and transversion plot, Ti/Tv)显示了样本队列中SNV的分布

* 分为 6 个转换和颠换事件

* 堆积条形图（底部）显示了MAF文件中每个样本的突变谱分布 (Distribution of mutation spectra)

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)# plot titv summaryplotTiTv(res = laml.titv)

突变及结构域在蛋白上的分布

棒棒糖图 (Lollipop plot) 显示了样本队列中某个基因的突变在蛋白上的分布，以及蛋白结构域。

* 图中可标记了*反复出现的热点突变* (Recurrent hotspots)

* 体细胞突变率 (Somatic mutation rate) 和转录本名称分别由图的标题和副标题表示

* 该函数假设了蛋白质变化信息存储在'Protein_Change'列下

* 如果需要，可使用参数 AACol 重新定义

1个基因经常包含多个不同的转录本：

* 可使用 refSeqID 或 proteinID 参数自定义某个特定的转录本名称
* # HGNC refseq.ID protein.ID aa.length
* # 1: DNMT3A  NM_175629  NP_783328  912
* # 2: DNMT3A  NM_022552  NP_072046  912
* # 3: DNMT3A  NM_153759  NP_715640  723
* # Using longer transcript NM_175629 for now.

当某个突变的氨基酸位置，未出现在上述转录本的全长中，程序提示：

* Removed 3 mutations for which AA position was not available

# lollipop plot for one gene, which is one of the most frequent mutated gene in tumorlollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change',             showMutationRate = TRUE)

##      HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629  NP_783328       912
## 2: DNMT3A NM_022552  NP_072046       912
## 3: DNMT3A NM_153759  NP_715640       723

是否将蛋白结构域名称作为 Legend 展示

* 'labelPos'参数用于指定想要标记的突变

lollipopPlot(maf = laml, gene = 'DNMT3A', showDomainLabel = FALSE,             labelPos = 882)

##      HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_175629  NP_783328       912
## 2: DNMT3A NM_022552  NP_072046       912
## 3: DNMT3A NM_153759  NP_715640       723

读入另一组实验数据

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")brca = read.maf(maf = brca, verbose = FALSE)table(brca@data$Tumor_Sample_Barcode)

## 
## TCGA-A8-A08B 
##           31

单样本Ti/Tv雨点图

显示某单个样本的、高突变基因组区域的降雨图/雨点图 (Rainfall plot)

* 每个点都是根据SNV的类别进行颜色编码

* 通过变点法 (Change-point method) 识别的超高突变基因组片段，可用黑色箭头突出显示

rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.6)

##    Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
## 1:          8       98129348     98133560     7               702.0000 4212
## 2:          8       98398549     98403536     9               623.3750 4987
## 3:          8       98453076     98456466     9               423.7500 3390
## 4:          8      124090377    124096810    22               306.3333 6433
## 5:         12       97436055     97439705     7               608.3333 3650
## 6:         17       29332072     29336153     8               583.0000 4081
##    Tumor_Sample_Barcode C>G C>T
## 1:         TCGA-A8-A08B   4   3
## 2:         TCGA-A8-A08B   1   8
## 3:         TCGA-A8-A08B   1   8
## 4:         TCGA-A8-A08B   1  21
## 5:         TCGA-A8-A08B   4   3
## 6:         TCGA-A8-A08B   4   4

比较突变负荷 (Mutation load) 与TCGA队列

* 将输入MAF中的突变负荷与来自MC3项目的所有33个TCGA队列进行比较

* 肿瘤突变负荷 (Tumor mutational burden, TMB)是指在患者的癌细胞中发现的基因突变数量

  * 通常：高TMB的癌症更可能对免疫治疗或靶向治疗等治疗类型，有反应；而低TMB的肿瘤对这些治疗类型的反应不大

laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')

# 对突变负荷 (Mutation burden) 的差异，进行：配对t检验 (Pairwise t-test)  # Warning message:  #   In FUN(X[[i]], ...) : Removed 1 samples with zero mutations.

基因的等位基因变异频率分布

基因的 VAF (Variant Allele Frequency, 等位基因变异频率) 分布

* 箱线图绘制中，抖动(jitter)中的每个点(dot)，都是该基因的 1种变异

  * VAF 是检测到的、与特定DNA变异匹配的测序Reads的百分比，分母是该位点的总Reads数
  * VAF 是原始样本中携带变异的DNA分子比例的测量 (例如可代表血浆cfDNA中突变肿瘤DNA的百分比)

# Plots vaf distribution of genes as a boxplot. Each dot in the jitter is a variant.# pdf("Fig.箱线图绘制基因的vaf分布_抖动-jitter-中的每个点-dot-都是1个变异.pdf")plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

# dev.off()

CNV

maftools也提供了整合和分析拷贝数变异 (CNV) 数据的选项

* CNV数据可来自：GISTIC和CBS (Circular binary segmentation)算法

* GISTIC (Genomic identification of significant targets in cancer, 癌症中重要靶点的基因组鉴定)

  * GISTIC模块可识别在1组样本中显着扩增或缺失 (Amplified or Deleted)的基因组区域
  * 每个异常 (Aberration) 都被分配了一个G-score，该得分考虑了异常的幅度、及跨样本发生的频率
  * https://broadinstitute.github.io/gistic2/

* CBS (Circular binary segmentation, 循环二元分割算法)是常用的 SNParray 数据分段算法，使用芯片数据检测CNV，目前也适用于NGS数据

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)

## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples

# GISTIC objectlaml.gistic

## An object of class  GISTIC 
##           ID summary
## 1:   Samples     191
## 2:    nGenes    2622
## 3: cytoBands      16
## 4:       Amp     388
## 5:       Del   26481
## 6:     total   26869

gisticChromPlot(gistic = laml.gistic, markBands = "all")

gisticBubblePlot(gistic = laml.gistic)

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")plotCBSsegments(cbsFile = tcga.ab.009.seg)

## NULL

突变的互斥、共存

突变事件存在互斥或共存的 (Exclusive / Co-occurring)的基因对

* 三角矩阵，绿色表示共现 (Co-occurring) 趋势；金色表示排它性 (Exclusiveness) 趋势

# exclusive/co-occurance event analysis on top 10 mutated genessomaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))

鉴定肿瘤Driver/驱动基因

* maftools的1个函数oncodrive。根据oncodriveCLUST算法进行运算

* 此概念基于：大多数致癌基因的变异在少数特定位点(即热点)的富集

laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')head(laml.sig)

plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE)

Pfam蛋白结构域的注释、汇总

laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)

# Protein summary (Printing first 7 columns for display convenience)laml.pfam$proteinSummary[,1:7, with = FALSE]

# Domain summary (Printing first 3 columns for display convenience)laml.pfam$domainSummary[,1:3, with = FALSE]

单基因、多基因生存分析

# Survival analysis based on grouping of DNMT3A mutation statusmafSurvival(maf = laml, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)

## DNMT3A 
##     48 
##     Group medianTime   N
## 1: Mutant        245  45
## 2:     WT        396 137

利用前20个突变基因，识别1组基因(个数为2)，来预测预后不良的分组 (Poor prognostic groups)

Hide

# Using top 20 mutated genes to identify a set of genes (of size 2) to predict poor prognostic groupsprog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)print(prog_geneset)

##     Gene_combination P_value    hr  WT Mutant
##  1:      FLT3_DNMT3A 0.00104 2.510 164     18
##  2:      DNMT3A_SMC3 0.04880 2.220 176      6
##  3:      DNMT3A_NPM1 0.07190 1.720 166     16
##  4:      DNMT3A_TET2 0.19600 1.780 176      6
##  5:        FLT3_TET2 0.20700 1.860 177      5

选择显著的，绘制生存曲线

mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")

##     Group medianTime   N
## 1: Mutant      242.5  18
## 2:     WT      379.5 164

两组样本队列之间的差异突变分析

队列比较：比较两个队列的MAF

#Primary APL MAFprimary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")primary.apl = read.maf(maf = primary.apl)# Relapse APL MAFrelapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")relapse.apl = read.maf(maf = relapse.apl)

# Considering only genes which are mutated in at-least in 5 samples in one of the cohort to avoid bias due to genes mutated in single samplept.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)print(pt.vs.rt)

## $results
##    Hugo_Symbol Primary Relapse         pval         or       ci.up      ci.low
## 1:         PML       1      11 1.529935e-05 0.03537381   0.2552937 0.000806034
## 2:        RARA       0       7 2.574810e-04 0.00000000   0.3006159 0.000000000
## 3:       RUNX1       1       5 1.310500e-02 0.08740567   0.8076265 0.001813280
## 4:        FLT3      26       4 1.812779e-02 3.56086275  14.7701728 1.149009169
## 5:      ARID1B       5       8 2.758396e-02 0.26480490   0.9698686 0.064804160

队列之间的差异突变基因 (Differentially mutated genes between two cohorts) 显示为森林图 (Forest plot)

* 条形表示比值比 (Odds ratio) 的95%置信区间

* 图中相邻的2列数值：在被突出显示的基因的突变，在两组样本队列中的样本数量

* P值：(∗∗∗) P<0.001；(∗∗) P<0.01；费舍尔精确检验

forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1, color = c('royalblue', 'maroon'), geneFontSize = 0.8)

指定基因集，并排画2个oncoplot，进行队列比较

* 例如：涉及某些信号通路 (的某些基因)，其相关基因的*突变途径 (Mutated pathways)*

* 与这些途径相关的基因，是否优先在其中1个队列中富集，是否以互斥的方式突变？

 * eg: Mutated pathways involving genes associated with VGSC and ERBB signaling in EAC (食管腺癌). 
 * Genes associated with these pathways are preferentially enriched in EAC, mutated in a mutually exclusive manner.

genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)

并排画两个条形图，用于不同队列或分组的比较

coBarplot(m1 = primary.apl, m2 = relapse.apl, m1Name = "Primary", m2Name = "Relapse")

并排画两个叠加的、蛋白突变及结构域图，用于单个蛋白突变形式的、不同队列或分组的比较

lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")

每个基因在某个分组中，突变差异显著性的比较

fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')

## 
## M0 M1 M2 M3 M4 M5 M6 M7 
## 19 44 44 21 39 19  3  3

# Results are returned as a list. Significant associations p-value < 0.05fab.ce$groupwise_comparision[p_value < 0.05]

Hugo_Symbol <chr>	Group1 <chr>	Group2 <chr>	n_mutated_group1 <chr>	n_mutated_group2 <chr>	p_value <dbl>
IDH1	M1	Rest	11 of 44	7 of 149	0.0002597371
TP53	M7	Rest	3 of 3	12 of 190	0.0003857187
DNMT3A	M5	Rest	10 of 19	38 of 174	0.0057610493
CEBPA	M2	Rest	7 of 44	6 of 149	0.0117352110
RUNX1	M0	Rest	5 of 19	11 of 174	0.0117436825
NPM1	M5	Rest	7 of 19	26 of 174	0.0248582372
CEBPA	M1	Rest	6 of 44	7 of 149	0.0478737468

plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05)

药物-基因的互作

检查药物-基因相互作用和药物类别

dgi = drugInteractions(maf = laml, fontSize = 0.75)

dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)

## Number of claimed drugs for given genes:
##      Gene N
## 1: DNMT3A 7

# Printing selected columnsdnmt3a.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)]

已知的致癌生物途径的富集

OncogenicPathways(maf = laml)

PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")

基于变异等位基因频率(VAF)的样本的变异聚类

Variant Allele Frequencies, VAF

# Heterogeneity in sample TCGA.AB.2972# install.packages("mclust")# library("mclust")# tsb参数：指定必须为其进行聚类的样本名称 (tumor_sample_barcode)  # specify sample names (Tumor_Sample_Barcodes) for which clustering has to be donetable('TCGA-AB-2972' %in% laml@data$Tumor_Sample_Barcode)

## 
## TRUE 
##    1

tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')

print(tcga.ab.2972.het$clusterMeans)

##    Tumor_Sample_Barcode cluster   meanVaf
## 1:         TCGA-AB-2972       2 0.4496571
## 2:         TCGA-AB-2972       1 0.2454750
## 3:         TCGA-AB-2972 outlier 0.3695000

# Visualizing resultsplotClusters(clusters = tcga.ab.2972.het)

读入CBS segmented的CNV文件

列名应该是Sample，染色体，Start, End, Num_Probes, Segment_Mean (log2 scale度)

seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')

##    Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
## 1:        PHF6         23      133551224    133551224         TCGA-AB-3009
##        t_vaf Segment_Start Segment_End Segment_Mean CN
## 1: 0.9349112            NA          NA           NA NA

# Visualizing results. Highlighting those variants on copynumber altered variantsplotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)

突变特征 (Mutational Signatures)

Nature (2013, doi:10.1038/nature12477)中的定义：

* 所有的癌症都是由体细胞突变引起的 (All cancers are caused by somatic mutations)，目前对于该生物学过程的理解有限。而来自癌症基因组的体细胞突变目录 (Catalogue)，具有已经起作用的突变过程的特征 (The catalogue of somatic mutations from a cancer genome bears the signatures of the mutational processes that have been operative)。
* 即：细胞在成长过程中基因组不断受到内源性和外源性DNA损伤的威胁，不断发生变化，最终发生一些突变的积累；每一个突变过程都会留下一个不同的基因组标记，称为"Mutational Signatures"。
* 通过分析来自7,042个癌症(Primary cancers of 30 different classes; 507 from WGS and 6,535 from WES)的4,938,362个突变，并提取了20多种不同的突变特征。
* 一些存在于许多癌症类型中，特别是胞苷脱氨酶APOBEC家族的特征，而其它局限于单一的癌症类型。
* 某些特征与癌症诊断时患者的年龄、已知的诱变暴露或DNA维持缺陷有关，但许多特征是未知的。
* 除了这些全基因组突变特征外，在许多癌症类型中还发现了位于小基因组区域“kataegis”的超突变。
* 突变特征的研究，揭示了癌症发展背后的突变过程的多样性，对了解癌症病因、预防和治疗具有潜在意义。

APOBEC富集

* 在突变位点的两侧提取单个5'和3'碱基，进行de-novo特征分析，同时估计APOBEC富集分数。
* APOBEC3G蛋白属APOBEC家族成员，又称为CEM15蛋白，是一种胞嘧啶脱氨酶。
* APOBEC家族成员均被证明对单链DNA/RNA上的胞苷有脱氨基作用，可通过诱发靶基因上的胞嘧啶脱氨基成为尿嘧啶(C>U) 参与大部分基因突变。

# Requires BSgenome objectlibrary(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)

UCSC提供的人全基因组序列(eg: hg19 / GRCh37.p13)，存储在Biostrings对象中

* 这个BSgenome数据包由以下源数据文件制成: hg19.2bit (二进制，下载自：https://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/latest/ , March 24, 2020)

laml.tnm = trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19")

## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  3.315 % of samples (APOBEC enrichment score > 2 ;  6  of  181  samples)
## -Creating mutation matrix
## --matrix of dimension 188x96

绘图展示APOBEC富集和非APOBEC富集样本之间的差异

plotApobecDiff(tnm = laml.tnm, maf = laml, pVal = 0.2)

## $results
##      Hugo_Symbol Enriched nonEnriched       pval        or      ci.up
##   1:        TP53        2          13 0.08175632 5.9976455  46.608861
##   2:        TET2        1          16 0.45739351 1.9407002  18.983979
##   3:        FLT3        2          45 0.65523131 1.4081851  10.211621
##   4:      DNMT3A        1          47 1.00000000 0.5335362   4.949499
##   5:      ADAM11        0           2 1.00000000 0.0000000 164.191472

# Plots differences between APOBEC enriched and non-APOBEC enriched samples

根据共表型相关度量 (Cophenetic correlation metric) 估计癌症特征(Signatures)的数量

可使用最大至6个Signatures

#  library('NMF')#  laml.sign = estimateSignatures(mat = laml.tnm, nTry = 6)# Run main function with maximum 6 signatureslaml.sign = estimateSignatures(mat = laml.tnm, nTry = 6, pConstant = 0.1, plotBestFitRes = FALSE, parallel = 2)

## -Running NMF for 6 ranks
## Compute NMF rank= 2  ... + measures ... OK
## Compute NMF rank= 3  ... + measures ... OK
## Compute NMF rank= 4  ... + measures ... OK
## Compute NMF rank= 5  ... + measures ... OK
## Compute NMF rank= 6  ... + measures ... OK

## -Finished in 00:01:00 elapsed (12.0s cpu)

plotCophenetic(res = laml.sign)

COSMIC数据库

* COSMIC (Catalogue Of Somatic Mutations In Cancer, 癌症体细胞突变目录) 数据库记录了与人类癌症相关的驱动基因，搜集和记录了大量文献中的数百万肿瘤样本的编码区变异，以及促进癌症的所有体细胞突变的非编码区变异、基因融合、拷贝数变异和耐药基因变异。
* COSMIC采用人工审核的方式扩充数据库，实时追踪文献的最新信息，将与癌症相关的重要基因优先放入数据库中。
* COSMIC-3D可研究三维蛋白结构中的突变，及其对蛋白功能和药物的影响。

#  laml.sig = extractSignatures(mat = laml.tnm, n = 3)laml.sig = extractSignatures(mat = laml.tnm, n = 3, pConstant = 0.1,  parallel = 2)

与原始的 30 个 signatures 比较 (COSMIC signatures)

# laml.og30.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "legacy")

与更新的 60 个 signatures 比较 (version3, COSMIC signatures)

# laml.v3.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "SBS")# save(file = "compareSignatures.RData", list = ls())

rm(list = ls())load("compareSignatures.RData")pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")

pheatmap::pheatmap(mat = laml.v3.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")

maftools::plotSignatures(nmfRes = laml.sig, title_size = 0.8, sig_db = "SBS")

3D条形图展示第1个Signature

#  Visualize first signaturesig1 = laml.sig$signatures[,1]barplot3d::legoplot3d(contextdata = sig1, labels = FALSE, scalexy = 0.01, sixcolors = "sanger", alpha = 0.5)#  colnames(laml.sign$contributions) = as.character(getSampleSummary(x = laml)[,Tumor_Sample_Barcode])[1:187]

laml.se = signatureEnrichment(maf = laml, sig_res = laml.sig)

## 
## Signature_1 Signature_2 Signature_3 
##          60          65          63

plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)

分析ANNOVAR变异注释数据

var.annovar = system.file("extdata", "variants.hg19_multianno.txt", package = "maftools")var.annovar.maf = annovarToMaf(annovar = var.annovar, Center = 'CSI-NUS', refBuild = 'hg19', tsbCol = 'Tumor_Sample_Barcode', table = 'ensGene')

## -Reading annovar files
## -Processing Exonic variants
## --Adding Variant_Classification
## --Parsing aa-change
## -Processing Non-exonic variants
## --Adding Variant_Classification
## -Adding Variant_Type
## -Converting Ensemble Gene IDs into HGNC gene symbols
## --Done. Original ensemble gene IDs are preserved under field name ens_id
## Finished in 1.100s elapsed (0.210s cpu)

# Read sample ICGC data for ESCAesca.icgc <- system.file("extdata", "simple_somatic_mutation.open.ESCA-CN.sample.tsv.gz", package = "maftools")esca.maf <- icgcSimpleMutationToMAF(icgc = esca.icgc, addHugoSymbol = TRUE)

## --Removed 427 duplicated variants

#Printing first 16 columns for display convenienceprint(esca.maf[1:5,1:16, with = FALSE])

##    Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
## 1:  AC005237.4             NA     NA     GRCh37          2      241987787
## 2:  AC061992.1            786     NA     GRCh37         17       76425382
## 3:  AC097467.2             NA     NA     GRCh37          4      156294567
## 4:    ADAMTS12             NA     NA     GRCh37          5       33684019
## 5:  AL589642.1          54801     NA     GRCh37          9       32630154

基因名称 (如曾用名) 的转换

laml.mutsig.corrected = prepareMutSig(maf = laml)#  Converting gene names for 1 variants from 1 genes#     Hugo_Symbol  MutSig_Synonym  N#  1:    ARHGAP35           GRLF1  1#  Original symbols are preserved under column OG_Hugo_Symbol.

输出本分析流程中，各个软件的版本

#  devtools::install_github(repo = "PoisonAlien/TCGAmutations")sessionInfo()

资料来源

maftools : Summarize, Analyze and Visualize MAF Files

Anand Mayakonda 2022-04-26

https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html

Package ‘maftools’ August 11, 2022

https://master.bioconductor.org/packages/release/bioc/manuals/maftools/man/maftools.pdf

Mutation Annotation Format, MAF

https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/

vcf2maf 程序

https://github.com/mskcc/vcf2maf

gakt funcotator 程序

https://gatk.broadinstitute.org/hc/en-us/articles/360035889931-Funcotator-Information-and-Tutorial

ANNOVAR

https://annovar.openbioinformatics.org/en/latest/

示例数据：TCGA的 LAML 队列

Cancer Genome Atlas Research, N. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 368, 2059-74 (2013).

ixxmu / mp_duty

一文学会R包 maftools 简介使用 #3689

一文学会R包 maftools 简介使用 by 生信宝典

生成 MAF 文件

MAF文件中需要的字段 (Field，即：文件中的“列”)

初始化，准备R包

读入maf文件及临床信息

MAF文件概要

对MAF对象取子集

MAF汇总图

体细胞突变景观图

转换和颠换变异分布

突变及结构域在蛋白上的分布

读入另一组实验数据

单样本Ti/Tv雨点图

比较突变负荷 (Mutation load) 与TCGA队列

基因的等位基因变异频率分布

CNV

突变的互斥、共存

Pfam蛋白结构域的注释、汇总

单基因、多基因生存分析

两组样本队列之间的差异突变分析

药物-基因的互作

已知的致癌生物途径的富集

基于变异等位基因频率(VAF)的样本的变异聚类

分析ANNOVAR变异注释数据