Closed ixxmu closed 1 year ago
上期我们讲到DNA甲基化数据的免疫浸润分析,这一期接着和大家分享DNA甲基化数据处理的R包,看看甲基化和基因表达谱数据碰撞在一起,能擦出多大的火花~
doi: 10.1186/s13059-015-0668-3.
Inferring regulatory element landscapes and transcription factor networks from cancer methylomes - PubMed (nih.gov)
来自2015 Genome Biology 一区
一句话形容:ELMER
(Enhancer Linking by Methylation/Expression Relationships)使用DNA 甲基化来识别增强子,并将增强子状态与附近基因的表达联系起来,以确定转录目标。
在认识这个包的函数之前,我们想先简单介绍两个数据结构:
MultiAssayExperiment
和它的姊妹SummarizedExperiment
,别小看它们,二者是生信分析的关键数据存储对象。
MultiAssayExperiment的基本结构如下:
library(MultiAssayExperiment)
library(GenomicRanges)
empty <- MultiAssayExperiment()
empty
## A MultiAssayExperiment object of 0 listed
## experiments with no user-defined names and respective classes.
## Containing an ExperimentList class object of length 0:
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
slotNames(empty)
## [1] "ExperimentList" "colData" "sampleMap" "drops"
## [5] "metadata"
MultiAssayExperiment对象有三个组成部分:
ExperimentList
:一个任意类别的检测数据集的列表,每个观察值有一列。colData
:提供关于病人、细胞系或其他生物单位的数据,每个单位有一行,每个变量有一列。sampleMap
:将assay(experiments)中的每一列(观察)与colData中的一行(生物单位)精确联系起来;然而,colData的一行可以映射到每个assay的零、一或多列,允许缺少和重复assay。绿色条纹表示一个experiment中的一个主题与多个观察点(observations)的映射。SummarizedExperiment
结构相对更容易理解,是一个类似矩阵的容器,其中行代表感兴趣的特征(如基因、转录本、外显子等),列代表样本。
这些对象包含一个或多个assay,每个assay由一个数字或其他模式的矩阵状对象。
SummarizedExperiment对象的行代表感兴趣的features。
关于这些features的信息被存储在一个DataFrame对象中,可以用函数rowData()访问。
DataFrame的每一行都提供了SummarizedExperiment对象的相应行中的features信息。
DataFrame的列代表感兴趣的features的不同属性。(例如,基因或转录本ID等)
library(SummarizedExperiment)
data(airway, package="airway")
se <- airway
se
## class: RangedSummarizedExperiment
## dim: 64102 8
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
可以用assays()
取出SummarizedExperiment
对象
至于什么时候用前者,什么时候用后者呢?
For assays with different numbers of rows and even columns,
MultiAssayExperiment
is recommended.For sets of assays with the same information across all rows (e.g., genes or genomic ranges),
SummarizedExperiment
is the recommended data structure.
接着就是代码实操啦⬇
createMAE | 构建用于ELMER分析的对象 |
---|
基本参数要求如下:【很重要哦】
# TCGA example using TCGAbiolinks
# Testing creating MultyAssayExperiment object
# Load library
library(TCGAbiolinks)
library(SummarizedExperiment)
samples <- c(
"TCGA-BA-4074", "TCGA-BA-4075", "TCGA-BA-4077", "TCGA-BA-5149",
"TCGA-UF-A7JK", "TCGA-UF-A7JS", "TCGA-UF-A7JT", "TCGA-UF-A7JV"
)
DNA甲基化和基因表达对象应该有相同的样本名,并作为列名
#1) Get gene expression matrix
# Aligned against Hg19
query.exp.hg19 <- GDCquery(
project = "TCGA-HNSC",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = samples,
legacy = TRUE
)
GDCdownload(query.exp.hg19)
exp.hg19 <- GDCprepare(query.exp.hg19)
# Our object needs to have emsembl gene id as rownames
rownames(exp.hg19) <- values(exp.hg19)$ensembl_gene_id
#2) DNA Methylation
query.met <- GDCquery(
project = "TCGA-HNSC",
legacy = FALSE,
data.category = "DNA Methylation",
data.type = "Methylation Beta Value",
barcode = samples,
platform = "Illumina Human Methylation 450"
)
GDCdownload(query.met)
met <- GDCprepare(query = query.met)
#3)
distal.enhancer <- get.feature.probe(genome = "hg19",met.platform = "450k")
#4) Consisering it is TCGA and SE
mae.hg19 <- createMAE(
exp = exp.hg19,
met = met,
TCGA = TRUE,
genome = "hg19",
filter.probes = distal.enhancer
)
values(getExp(mae.hg19))
直接从TCGA下载特定癌症类型的DNA甲基化、RNA表达和临床数据。
这一步对网速的要求比较高
getTCGA(
disease = "BRCA",
Meth = FALSE,
RNA = FALSE,
Clinic = TRUE,
basedir = tempdir(),
genome = "hg19"
)
下载的数据将被转换为矩阵或数据框,以便进一步分析。
识别两组之间的低/高甲基化的CpG位点(即正常与肿瘤样本对比)。
基本参数如下:
Arguments | |
---|---|
data | A multiAssayExperiment with DNA methylation and Gene Expression data. |
diff.dir | A character can be "hypo", "hyper" or "both", showing differential methylation direction |
cores | 线程数 |
mode | Can be "unsupervised" or "supervised". |
pvalue | A number specifies the significant P value (adjusted P value by BH) threshold Limit for selecting significant hypo/hyper-methylated probes. |
group.col | A column defining the groups of the sample. 可以在createMAE那一步用colData创造. |
group1 | A group from group.col. |
group2 | A group from group.col. |
test | 统计方法 |
sig.dif | A number specifies the smallest DNA methylation difference as a cutoff for selecting significant hypo/hyper-methylated probes. Default is 0.3. |
save | When TRUE, two getMethdiff.XX.csv files will be generated |
Hypo.probe <- get.diff.meth(data,
diff.dir="hypo", ##这里可以还选择hyper
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
sig.dif = 0.1) # get hypomethylated probes
绘制基因表达和DNA甲基化之间的散点图
scatter.plot(data,
byProbe=list(probe=c("cg19403323"),numFlankingGenes=20),
category="definition",
save=TRUE) ## save to pdf
# b. generate one probe-gene pair
scatter.plot(data,
byPair=list(probe=c("cg19403323"),gene=c("ENSG00000143322")),
category="definition",
save=FALSE,
lm_line=TRUE)
还有一个参数:byTF ==list(TF=c(), probe=c())
绘制基因表达和DNA甲基化之间的散点图
metBoxPlot(data,
group.col = group.col, ## 可以通过colnames(MultiAssayExperiment::colData(data))调用
group1 = group1,
group2 = group2,
probe ="cg17898069",
minSubgroupFrac = 0.2,
diff.dir = "hypo")
识别一组探针(HM450K)区域中的高代表度的motif
# If the MAE is set, the background and the probes.motif will be automatically set
enriched.motif <- get.enriched.motif(data = data,
min.motif.quality = "DS",
probes=probes,
pvalue = 1,
min.incidence=2,
label="hypo")
识别调控TFs。
TF <- get.TFs(data,
enriched.motif, ##上一步的运行结果
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
TFs = data.frame(
external_gene_name=c("TP53","TP63","TP73"),
ensembl_gene_id= c("ENSG00000141510",
"ENSG00000073282","ENSG00000078900"),##A data.frame containing TF GeneID and Symbol or a path of XX.csv file
stringsAsFactors = FALSE),
label="hypo",
save = TRUE ##If save is true, two files will be saved: getTF.XX.significant.TFs.with.motif.summary.csv and getTF.hypo.TFs.with.motif.pvalue.rda
)
TF <- get.TFs(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
enriched.motif,
label="hypo")
为一个基因座收集邻近的基因
geneAnnot <- getTSS(genome = "hg38")
probe <- GenomicRanges::GRanges(seqnames = c("chr1","chr2"),
range=IRanges::IRanges(start = c(16058489,236417627), end= c(16058489,236417627)),
name= c("cg18108049","cg17125141"))
names(probe) <- c("cg18108049","cg17125141")
GetNearGenes(
data = data,##第一步函数创造的multiAssayExperiment对象
probes = probe,
geneAnnot = geneAnnot,
TRange = range,
numFlankingGenes = 20
)
获取TF靶基因
getTFtargets(
pairs,##Output of get.pairs function: dataframe or file path
enriched.motif,##The file created by ELMER is getMotif...enriched.motifs.rda
TF.result,##get.TFs的结果
dmc.analysis,
mae,##A multiAssayExperiment outputed from createMAE function
save = TRUE,
dir.out = "./",
classification = "family",
cores = 1,
label = NULL
)
从Bioconductor软件包biomaRt中获取GENCODE基因注释(转录水平)。
如果在TSS列表中指定上游和下游,GENCODE基因的启动子区域将被生成产生。
# get GENCODE gene annotation (transcripts level)
getTSS <- getTSS()
getTSS <- getTSS(genome.build = "hg38", TSS=list(upstream=1000, downstream=1000))
TF.rank.plot绘制分数(-log10(P值)),评估TF表达水平和motif位点的平均DNA甲基化之间的相关性。
TF <- get.TFs(data,
enriched.motif,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
TFs = data.frame(
external_gene_name=c("TP53","TP63","TP73"),
ensembl_gene_id= c("ENSG00000141510",
"ENSG00000073282",
"ENSG00000078900"),
stringsAsFactors = FALSE),
label="hypo")
TF.meth.cor <- get(load("getTF.hypo.TFs.with.motif.pvalue.rda"))
TF.rank.plot(motif.pvalue=TF.meth.cor,
motif="P53_HUMAN.H11MO.0.A",
TF.label=createMotifRelevantTfs("subfamily")["P53_HUMAN.H11MO.0.A"],
save=TRUE)
预测增强子-基因的关联性
Hypo.pair <- get.pair(data = data,##第一步函数创造的multiAssayExperiment对象
nearGenes = nearGenes,##GetNearGenes运行的结果
permu.size = 5,
raw.pvalue = 0.2,
Pe = 0.2,
dir.out = "./",
diffExp = TRUE,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
label = "hypo")
绘图显示基因和探针的位置
pair
is the ouput of get.pair
function.
schematic.plot(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
pair = pair,
byProbe = "cg19403323")
schematic.plot(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
pair = pair,
byGeneID = "ENSG00000009790")
schematic.plot(data,
group.col = "definition",
group1 = "Primary solid Tumor",
group2 = "Solid Tissue Normal",
pair = pair,
byCoordinate = list(chr="chr1", start = 209000000, end = 209960000))
计算基因表达与DNA 启动子区域的甲基化
Arguments | |
---|---|
data | A Multi Assay Experiment object with DNA methylation and gene expression Summarized Experiment objects |
sig.pvalue | A number specifies significant cutoff for gene silenced by promoter methylation. Default is 0.01. P value is raw P value without adjustment. |
minSubgroupFrac | A number ranging from 0 to 1 specifying the percentage of samples used to create the groups U (unmethylated) and M (methylated) used to link probes to genes. |
upstream | Number of bp upstream of TSS to consider as promoter region |
downstream | Number of bp downstream of TSS to consider as promoter region |
save | If it is true, the result will be saved |
promoterMeth(data,
sig.pvalue = 0.01,
minSubgroupFrac = 0.4,
upstream = 200,
downstream = 2000,
save = TRUE,
cores = 1)
创建基于TF表达的生存图
TFsurvival.plot(data,
TF,##A gene symbol
xlim = NULL,##Limit x axis showed in plot
percentage = 0.3,
save = TRUE)
这里的data要求有 vital_status
, days_to_last_follow_up
and days_to_death
这三列临床信息~
探针DNA甲基化和单一基因表达之间的相关性热图。
To use this function you MAE object (input data) will need all probes and not only the distal ones.
This plot can be used to evaluate promoter, and intro, exons regions and closer distal probes of a gene to verify if their DNA methylation level is affecting the gene expression.
全部参数:
heatmapGene(
data,
group.col, ##colData(mae)调用第一步的结果
group1,
group2,
pairs,
GeneSymbol,##比如"LAMB3"
scatter.plot = TRUE,
correlation.method = "pearson",
correlation.table = FALSE,
annotation.col = NULL,
met.metadata = NULL,
exp.metadata = NULL,
dir.out = ".",
filter.by.probe.annotation = TRUE,
numFlankingGenes = 10,
width = 10,
height = 10,
scatter.plot.width = 10,
scatter.plot.height = 10,
filename = NULL
)
heatmapGene(data = data,
group.col = group.col,
group1 = group1,
group2 = group2,
pairs = pairs,
GeneSymbol = "LAMB3",
height = 5,
annotation.col = c("ethnicity","vital_status"),
filename = "heatmap.pdf")
成对基因和探针反相关的热图
heatmapPairs(
data = data,
group.col = group.col,
group1 = group1,
group2 = group2,
annotation.col = c("ethnicity","vital_status","age_at_diagnosis"),
pairs,
filename = "heatmap.pdf",
height = 4,
width = 11
)
总而言之,这个包目前囊括的功能可以概括为:
差异甲基化分析
选择增强子探针
识别具有癌症特异性DNA甲基化变化的增强子探针
将有甲基化变化的增强子探针与有表达变化的目标基因联系起来
Motif 分析
将TF的表达与TF结合Motif 的甲基化联系起来
生存分析
今天就先抛砖引玉,大家可以先用TCGA的数据探索看看~
https://mp.weixin.qq.com/s/FoEi0FiYOubUyVMZp-F0bw