Below describes the general workflow for clustering motif models to remove redundancy, generating an "archetype" motif, and then finally, performing genome-wide scans of motifs and remvoval of redundancy. Please note that this documentation is incomplete, as should be used a a rough guide only.
If you are looking for the final results (motif clusters, genome-wide scans and a browser shot) please see the following website: https://resources.altius.org/~jvierstra/projects/motif-clustering/
Note: The above link is for version 1.0 of the motif clusters. While we have yet to update the website, version 2.1-beta (latest) can be found at https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.1beta/.
Contact me at jvierstra (at) altius.org
with any questions/requests/comments.
Version 2.0beta-human (https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.0beta/)
Version 1.0 (complete documation: https://resources.altius.org/~jvierstra/projects/motif-clustering/)
See runall
script in each motif database directory (databases/*
)
Here we use TOMTOM to determine the similarity between all motif models (all pairwise) with the following code:
meme2meme databases/*/*.meme > tomtom/all.dbs.meme
tomtom \
-bfile /net/seq/data/projects/motifs/hg19.K36.mappable_only.5-order.markov \
-dist kullback \
-motif-pseudo 0.1 \
-text \
-min-overlap 1 \
tomtom/all.dbs.meme tomtom/all.dbs.meme \
> tomtom/tomtom.all.txt
I have a provided a script that will load this operation up on a SLURM parallel compute cluster (see bin/runall.tomtom.v2.0beta-human for an example)
After running TOMTOM, open up the provided Jupyter Notebook to perform the clustering and visualization
We perform hierarchical clustering (distance: correlation, complete linkage) from the TOMTOM similarity E-values. Below is a heatmap representation of motifs clustered by simililarity and clusters identified cutting the dendrogram at height 0.7.
Again, inside the notebook there is code that will process and visualize each motif cluster.
AC0002 (homeodomain) | AC0240 (CCAAT-box) |
---|---|
Run the BASH script bin/runall.make-html
to generate an HTML webpage (index.html) in the results
directory
I use the software package MOODS to find motif matches genome-wide. Its a great tool and that I highly reccomend. See bin/runall.scan_models for an example of how to do this on a SLURM cluster.
To create a bigBed file from a bed9+4, we need to include an AutoSql file (bed_format.as)
table hg38_motifs_collapsed
"Collapsed motifs matches in hg38 (see: http://www.github.com/jvierstra/motif-clustering)"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of feature on chromosome"
uint chromEnd; "End position of feature on chromosome"
string name; "Name of motif"
uint score; "Score"
char[1] strand; "+ or - for strand"
uint thickStart; "Coding region start"
uint thickEnd; "Coding region end"
uint reserved; "itemRgb"
)
Make the tracks for the archetypes
bedToBigBed -as=bed_format.as -type=bed9+4 -tab moods.combined.all.bed chrom.sizes moods.combined.all.bb
awk -v OFS="\t" '{ print $1, $2, $3, $4, $11, $6, $10, $13}' moods.combined.all.bed | bgzip -c > moods.combined.all.bed.gz
tabix -p bed moods.combined.all.bed.gz
Make the tracks for the full motif scans.
fetchChromSizes hg38 > /tmp/chrom.sizes
awk -v OFS="\t" '{ print $1, 0, $2; }' /tmp/chrom.sizes | sort-bed - > /tmp/chrom.sizes.bed
bedops -e 100% moods.combined.all.bed /tmp/chrom.sizes.bed \
| awk -v OFS="\t" '{ print $1, $2, $3, $4, 0, $6, $2, $3, "0,0,0", $5, $7 }' > /tmp/moods
bedToBigBed -as=bed_format.as -type=bed9+2 -tab /tmp/moods chrom.sizes moods.combined.all.bb