kundajelab / tfmodisco

TF MOtif Discovery from Importance SCOres
MIT License
124 stars 29 forks source link

TF-MoDISco: Transcription-Factor Motif Discovery from Importance Scores

CircleCI license DOI

This repository contains the code developed for the associated manuscript, Distilling consolidated DNA sequence motifs and cooperative motif syntax from neural-network models of in vivo transcription-factor binding profiles. The analysis scripts and notebooks used to reproduce the results in this manuscript can be found at this repository.

General users should visit the TF-MoDISco-lite repository for a more efficient, actively maintained, and easier-to-use version of the same algorithm.

Structure of TF-MoDISco

The TF-MoDISco algorithm starts with a set of importance scores on genomic sequences, and can perform the following tasks:

  1. Identify high-importance windows of the sequences, termed "seqlets"
  2. Cluster recurring similar seqlets into motifs
  3. Scan through importance scores across the genome to call motif instances (AKA "hit scoring")

Installing TF-MoDISco

pip install modisco

Alternatively, for a specific tagged version or commit, install from source code by cloning this repository, checking out the desired version, and running pip install -e /path/to/cloned/repo.

Required inputs to run the algorithm

In order to run the TF-MoDISco algorithm, the following data is required as an input:

Other resources

A technical note describing version 0.5.6.5 is available at https://arxiv.org/abs/1811.00416.

Video of talk at NeurIPS MLCB 2017

Example notebooks for running the algorithm: