如何实现DNA甲基化和基因表达谱数据的联合分析

如何实现DNA甲基化和基因表达谱数据的联合分析 by 生信菜鸟团

上期我们讲到DNA甲基化数据的免疫浸润分析，这一期接着和大家分享DNA甲基化数据处理的R包，看看甲基化和基因表达谱数据碰撞在一起，能擦出多大的火花~

doi: 10.1186/s13059-015-0668-3.
Inferring regulatory element landscapes and transcription factor networks from cancer methylomes - PubMed (nih.gov)
来自2015 Genome Biology 一区

一句话形容：ELMER(Enhancer Linking by Methylation/Expression Relationships)使用DNA 甲基化来识别增强子，并将增强子状态与附近基因的表达联系起来，以确定转录目标。

在认识这个包的函数之前，我们想先简单介绍两个数据结构：

MultiAssayExperiment和它的姊妹SummarizedExperiment，别小看它们，二者是生信分析的关键数据存储对象。

MultiAssayExperiment的基本结构如下：

library(MultiAssayExperiment)
library(GenomicRanges)
empty <- MultiAssayExperiment()
empty

## A MultiAssayExperiment object of 0 listed
##  experiments with no user-defined names and respective classes.
##  Containing an ExperimentList class object of length 0:
##  Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

slotNames(empty)
## [1] "ExperimentList" "colData"        "sampleMap"      "drops"         
## [5] "metadata"

MultiAssayExperiment对象有三个组成部分：

ExperimentList：一个任意类别的检测数据集的列表，每个观察值有一列。
colData：提供关于病人、细胞系或其他生物单位的数据，每个单位有一行，每个变量有一列。
sampleMap：将assay（experiments）中的每一列（观察）与colData中的一行（生物单位）精确联系起来；然而，colData的一行可以映射到每个assay的零、一或多列，允许缺少和重复assay。绿色条纹表示一个experiment中的一个主题与多个观察点（observations）的映射。

SummarizedExperiment结构相对更容易理解，是一个类似矩阵的容器，其中行代表感兴趣的特征（如基因、转录本、外显子等），列代表样本。

这些对象包含一个或多个assay，每个assay由一个数字或其他模式的矩阵状对象。

SummarizedExperiment对象的行代表感兴趣的features。

关于这些features的信息被存储在一个DataFrame对象中，可以用函数rowData()访问。

DataFrame的每一行都提供了SummarizedExperiment对象的相应行中的features信息。

DataFrame的列代表感兴趣的features的不同属性。（例如，基因或转录本ID等）

library(SummarizedExperiment)
data(airway, package="airway")
se <- airway
se

## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

可以用assays()取出SummarizedExperiment 对象

至于什么时候用前者，什么时候用后者呢？

For assays with different numbers of rows and even columns, MultiAssayExperiment is recommended.
For sets of assays with the same information across all rows (e.g., genes or genomic ranges), SummarizedExperiment is the recommended data structure.

接着就是代码实操啦⬇

构建分析对象

createMAE

createMAE	构建用于ELMER分析的对象

基本参数要求如下：【很重要哦】

# TCGA example using TCGAbiolinks
# Testing creating MultyAssayExperiment object
# Load library
library(TCGAbiolinks)
library(SummarizedExperiment)
samples <- c(
"TCGA-BA-4074", "TCGA-BA-4075", "TCGA-BA-4077", "TCGA-BA-5149",
"TCGA-UF-A7JK", "TCGA-UF-A7JS", "TCGA-UF-A7JT", "TCGA-UF-A7JV"
)

DNA甲基化和基因表达对象应该有相同的样本名，并作为列名

#1) Get gene expression matrix
# Aligned against Hg19
query.exp.hg19 <- GDCquery(
project = "TCGA-HNSC",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = samples,
legacy = TRUE
)
GDCdownload(query.exp.hg19)
exp.hg19 <- GDCprepare(query.exp.hg19)
# Our object needs to have emsembl gene id as rownames
rownames(exp.hg19) <- values(exp.hg19)$ensembl_gene_id

#2) DNA Methylation
query.met <- GDCquery(
project = "TCGA-HNSC",
legacy = FALSE,
data.category = "DNA Methylation",
data.type = "Methylation Beta Value",
barcode = samples,
platform = "Illumina Human Methylation 450"
)
GDCdownload(query.met)
met <- GDCprepare(query = query.met)

#3)
distal.enhancer <- get.feature.probe(genome = "hg19",met.platform = "450k")

#4) Consisering it is TCGA and SE
mae.hg19 <- createMAE(
exp = exp.hg19,
met = met,
TCGA = TRUE,
genome = "hg19",
filter.probes = distal.enhancer
)
values(getExp(mae.hg19))

getTCGA

直接从TCGA下载特定癌症类型的DNA甲基化、RNA表达和临床数据。
这一步对网速的要求比较高

getTCGA(
disease = "BRCA",
Meth = FALSE,
RNA = FALSE,
Clinic = TRUE,
basedir = tempdir(),
genome = "hg19"
)

下载的数据将被转换为矩阵或数据框，以便进一步分析。

差异甲基化分析

get.diff.meth

识别两组之间的低/高甲基化的CpG位点（即正常与肿瘤样本对比）。

基本参数如下：

Arguments
data	A multiAssayExperiment with DNA methylation and Gene Expression data.
diff.dir	A character can be "hypo", "hyper" or "both", showing differential methylation direction
cores	线程数
mode	Can be "unsupervised" or "supervised".
pvalue	A number specifies the significant P value (adjusted P value by BH) threshold Limit for selecting significant hypo/hyper-methylated probes.
group.col	A column defining the groups of the sample. 可以在createMAE那一步用colData创造.
group1	A group from group.col.
group2	A group from group.col.
test	统计方法
sig.dif	A number specifies the smallest DNA methylation difference as a cutoff for selecting significant hypo/hyper-methylated probes. Default is 0.3.
save	When TRUE, two getMethdiff.XX.csv files will be generated

Hypo.probe <- get.diff.meth(data,
diff.dir="hypo", ##这里可以还选择hyper
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
sig.dif = 0.1) # get hypomethylated probes

scatter.plot

绘制基因表达和DNA甲基化之间的散点图

scatter.plot(data,
             byProbe=list(probe=c("cg19403323"),numFlankingGenes=20),
             category="definition",
             save=TRUE) ## save to pdf

# b. generate one probe-gene pair
scatter.plot(data,
             byPair=list(probe=c("cg19403323"),gene=c("ENSG00000143322")),
             category="definition",
             save=FALSE,
             lm_line=TRUE)

还有一个参数：byTF ==list(TF=c(), probe=c())

metBoxPlot

绘制基因表达和DNA甲基化之间的散点图

metBoxPlot(data,
group.col = group.col, ## 可以通过colnames(MultiAssayExperiment::colData(data))调用
group1 = group1,
group2 = group2,
probe ="cg17898069",
minSubgroupFrac = 0.2,
diff.dir = "hypo")

motif分析

get.enriched.motif

识别一组探针(HM450K)区域中的高代表度的motif

# If the MAE is set, the background and the probes.motif will be automatically set
enriched.motif <- get.enriched.motif(data = data,
min.motif.quality = "DS",
probes=probes,
pvalue = 1,
min.incidence=2,
label="hypo")

转录因子分析

get.TFs

识别调控TFs。

TF <- get.TFs(data,
enriched.motif, ##上一步的运行结果
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
TFs = data.frame(
external_gene_name=c("TP53","TP63","TP73"),
ensembl_gene_id= c("ENSG00000141510",
"ENSG00000073282","ENSG00000078900"),##A data.frame containing TF GeneID and Symbol or a path of XX.csv file 
stringsAsFactors = FALSE),
label="hypo",
save = TRUE ##If save is true, two files will be saved: getTF.XX.significant.TFs.with.motif.summary.csv and getTF.hypo.TFs.with.motif.pvalue.rda
)

TF <- get.TFs(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
enriched.motif,
label="hypo")

GetNearGenes

为一个基因座收集邻近的基因

geneAnnot <- getTSS(genome = "hg38")
probe <- GenomicRanges::GRanges(seqnames = c("chr1","chr2"),
range=IRanges::IRanges(start = c(16058489,236417627), end= c(16058489,236417627)),
name= c("cg18108049","cg17125141"))
names(probe) <- c("cg18108049","cg17125141")

GetNearGenes(
data = data,##第一步函数创造的multiAssayExperiment对象
probes = probe,
geneAnnot = geneAnnot,
TRange = range,
numFlankingGenes = 20
)

getTFtargets

获取TF靶基因

getTFtargets(
pairs,##Output of get.pairs function: dataframe or file path
enriched.motif,##The file created by ELMER is getMotif...enriched.motifs.rda
TF.result,##get.TFs的结果
dmc.analysis,
mae,##A multiAssayExperiment outputed from createMAE function
save = TRUE,
dir.out = "./",
classification = "family",
cores = 1,
label = NULL
)

getTSS

从Bioconductor软件包biomaRt中获取GENCODE基因注释（转录水平）。
如果在TSS列表中指定上游和下游，GENCODE基因的启动子区域将被生成产生。

# get GENCODE gene annotation (transcripts level)
getTSS <- getTSS()
getTSS <- getTSS(genome.build = "hg38", TSS=list(upstream=1000, downstream=1000))

TF.rank.plot

TF.rank.plot绘制分数（-log10(P值)），评估TF表达水平和motif位点的平均DNA甲基化之间的相关性。

TF <- get.TFs(data,
enriched.motif,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
TFs = data.frame(
external_gene_name=c("TP53","TP63","TP73"),
ensembl_gene_id= c("ENSG00000141510",
"ENSG00000073282",
"ENSG00000078900"),
stringsAsFactors = FALSE),
label="hypo")

TF.meth.cor <- get(load("getTF.hypo.TFs.with.motif.pvalue.rda"))

TF.rank.plot(motif.pvalue=TF.meth.cor,
motif="P53_HUMAN.H11MO.0.A",
TF.label=createMotifRelevantTfs("subfamily")["P53_HUMAN.H11MO.0.A"],
save=TRUE)

增强子-基因分析

get.pair

预测增强子-基因的关联性

Hypo.pair <- get.pair(data = data,##第一步函数创造的multiAssayExperiment对象
nearGenes = nearGenes,##GetNearGenes运行的结果
permu.size = 5,
raw.pvalue = 0.2,
Pe = 0.2,
dir.out = "./",
diffExp = TRUE,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
label = "hypo")

schematic.plot

绘图显示基因和探针的位置

pair is the ouput of get.pair function.

schematic.plot(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
pair = pair,
byProbe = "cg19403323")

schematic.plot(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
pair = pair,
byGeneID = "ENSG00000009790")

schematic.plot(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
pair = pair,
byCoordinate = list(chr="chr1", start = 209000000, end = 209960000))

promoterMeth

计算基因表达与DNA 启动子区域的甲基化

Arguments
data	A Multi Assay Experiment object with DNA methylation and gene expression Summarized Experiment objects
sig.pvalue	A number specifies significant cutoff for gene silenced by promoter methylation. Default is 0.01. P value is raw P value without adjustment.
minSubgroupFrac	A number ranging from 0 to 1 specifying the percentage of samples used to create the groups U (unmethylated) and M (methylated) used to link probes to genes.
upstream	Number of bp upstream of TSS to consider as promoter region
downstream	Number of bp downstream of TSS to consider as promoter region
save	If it is true, the result will be saved

promoterMeth(data,
             sig.pvalue = 0.01,
             minSubgroupFrac = 0.4,
             upstream = 200,
             downstream = 2000,
             save = TRUE, 
             cores = 1)

生存分析

TFsurvival.plot

创建基于TF表达的生存图

TFsurvival.plot(data, 
                TF,##A gene symbol
                xlim = NULL,##Limit x axis showed in plot
                percentage = 0.3, 
                save = TRUE)

这里的data要求有 vital_status, days_to_last_follow_up and days_to_death这三列临床信息~

热图

heatmapGene

探针DNA甲基化和单一基因表达之间的相关性热图。
To use this function you MAE object (input data) will need all probes and not only the distal ones.
This plot can be used to evaluate promoter, and intro, exons regions and closer distal probes of a gene to verify if their DNA methylation level is affecting the gene expression.

全部参数：

heatmapGene(
data,
group.col, ##colData(mae)调用第一步的结果
group1,
group2,
pairs,
GeneSymbol,##比如"LAMB3"
scatter.plot = TRUE,
correlation.method = "pearson",
correlation.table = FALSE,
annotation.col = NULL,
met.metadata = NULL,
exp.metadata = NULL,
dir.out = ".",
filter.by.probe.annotation = TRUE,
numFlankingGenes = 10,
width = 10,
height = 10,
scatter.plot.width = 10,
scatter.plot.height = 10,
filename = NULL
)

heatmapGene(data = data,
group.col = group.col,
group1 = group1,
group2 = group2,
pairs = pairs,
GeneSymbol = "LAMB3",
height = 5,
annotation.col = c("ethnicity","vital_status"),
filename = "heatmap.pdf")

heatmapPairs

成对基因和探针反相关的热图

heatmapPairs(
data = data, 
group.col = group.col,
group1 = group1,
group2 = group2,
annotation.col = c("ethnicity","vital_status","age_at_diagnosis"),
pairs,
filename = "heatmap.pdf",
height = 4,
width = 11
)

小结

总而言之，这个包目前囊括的功能可以概括为：
差异甲基化分析
选择增强子探针
识别具有癌症特异性DNA甲基化变化的增强子探针
将有甲基化变化的增强子探针与有表达变化的目标基因联系起来
Motif 分析
将TF的表达与TF结合Motif 的甲基化联系起来
生存分析

今天就先抛砖引玉，大家可以先用TCGA的数据探索看看~

ixxmu / mp_duty