Closed ixxmu closed 4 months ago
❝一般我们对得到的 peaks 文件根据研究目的也许会进行 motif 分析, 来揭示目的蛋白所结合的序列特征。常见的软件有 homer, meme。这次我们使用 monaLisa 来进行 motif 分析。
参考文档:
https://bioconductor.org/packages/release/bioc/vignettes/monaLisa/inst/doc/monaLisa.html#1_Introduction
BiocManager::install("monaLisa")
读取获得的 peaks 的 bed 文件, 注意需要让它们长度一致, 然后获取对应的序列:
BiocManager::install("monaLisa")
pacman::p_install_gh("da-bar/JASPAR2020")
setwd("~/JunJun/8.atacSeq_proj/7.motif-data/")
library(monaLisa)
library(JASPAR2020)
library(rtracklayer)
library(TFBSTools)
library(BSgenome.Hsapiens.UCSC.hg19)
# load peaks
rawpeak <- import.bed("../5.peak-data/CTL_overlap.bed")
# we would recommend to use regions of similar or even equal lengths to avoid a length bias
singleSet <- resize(rawpeak, width = median(width(rawpeak)), fix = "center")
# check
head(singleSet,3)
# GRanges object with 3 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 948979-949278 *
# [2] chr1 1051386-1051685 *
# [3] chr1 1057464-1057763 *
# -------
# seqinfo: 37 sequences from an unspecified genome; no seqlengths
提取序列:
# extract sequence
singleseqs <- getSeq(BSgenome.Hsapiens.UCSC.hg19, singleSet)
# check
head(singleseqs,3)
# DNAStringSet object of length 3:
# width seq
# [1] 300 GGGGAGGTGGGCTCTGTGCCAGCCAATTTTCGTCTCCCTCCC...GGATGTAGAGGACAGACAGGAGGGAGCACTGTCCCTGGGTA
# [2] 300 GACCCCAGCGCCCCAACCCGCTACCCTCACGCCTGCCCCCAG...CGGGGCGCGAAGCGCCCTGGGAAATGTAGTCCTAGAAGAAC
# [3] 300 AGGAGGCTGGGGGGCCTCTGTCCCGCCATGGCTTCCCCGGGA...TCCGACTCCTGCCCGGGAAGTCAGGCAGTGTGGCCGGGACC
准备 motif 数据库文件:
# Prepare motif enrichment analysis,
# extract all vertebrate motifs from the
# JASPAR2020 package as positional weight matrices (PWMs)
pwms <- getMatrixSet(JASPAR2020,
opts = list(matrixtype = "PWM",
tax_group = "vertebrates"))
富集分析:
# Single set motif enrichment analysis:
# comparing a set of sequences to a suitable background
sinse <- calcBinnedMotifEnrR(seqs = singleseqs,
pwmL = pwms,
background = "genome",
genome = BSgenome.Hsapiens.UCSC.hg19,
genome.regions = NULL, # sample from full genome
genome.oversample = 2,
BPPARAM = BiocParallel::SerialParam(RNGseed = 42),
verbose = TRUE)
# Filtering sequences ...
# in total filtering out 0 of 12335 sequences (0%)
# Scanning sequences for motif hits...
# Create motif hit matrix...
# starting analysis of bin 1
# Defining background sequence set (genome)...
# When we visualize motifs that are enriched with an adjusted p value of less than 10−4
pv <- data.frame(tf = names(assay(sinse, "negLog10Padj")[, 1]),
p = assay(sinse, "negLog10Padj")[, 1]) %>%
na.omit() %>%
dplyr::filter(p > 4) %>%
dplyr::arrange(desc(p))
绘图:
# plot
plotMotifHeatmaps(x = sinse[pv$tf[1:50],],
which.plots = c("log2enr", "negLog10Padj"),
width = 1.8, maxEnr = 2, maxSig = 10,
cluster = T,
show_dendrogram = TRUE,
show_seqlogo = TRUE)
我们用 homer 也看看 known motif 结果:
findMotifsGenome.pl ../5.peak-data/CTL_overlap.bed hg19 CTL_motif -mask
❝此外,这个包还可以根据 bed 文件一列的数值型变量,可以分成不同的 bin, r 然后进行富集分析, 包里举的例子是代表不同甲基化的水平, 图像下面这样,每一列是一个数值区间:
❝路漫漫其修远兮,吾将上下而求索。
欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群 (微信交流群需收取 20 元入群费用,一旦交费,拒不退还!(防止骗子和便于管理)) 。QQ 群可免费加入, 记得进群按格式修改备注哦。
老俊俊微信:
知识星球:
❝rgt-hint 转录因子预测结果可视化(续) rgt-hint 转录因子预测结果可视化 绘图时如何处理 Input 样本 ATAC-seq 数据分析实操教程 ATAC-seq 分析方法一览 locuszoomr 可视化 GWAS 结果 2024 了,听说你还在用 dplyr 处理数据? 计算 motif 到 peak center 的距离 写个包 metaplot 绘制 m6A peak 分布 当 ggSCvis 遇到 ggmagnify
https://mp.weixin.qq.com/s/nJJrH46Vg1a4v4vYVK0RhA