p-koo / learning_sequence_motifs

"Representation Learning of Genomic Sequence Motifs with Convolutional Neural Networks" by Peter K. Koo and Sean R. Eddy
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007560
31 stars 13 forks source link

Learning Sequence Motifs

This is a repository that contains datasets and scripts to reproduce the results of "Representation Learning of Genomic Sequence Motifs with Convolutional Neural Networks" by Peter K. Koo and Sean R. Eddy, which can be found: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007560

The code here depends on Deepomics, a custom-written, high-level APIs written on top of Tensorflow to seamlessly build, train, test, and evaluate neural network models. WARNING: Deepomics is a required sub-repository. To properly clone this repository, please use:

$ git clone --recursive \url{https://github.com/p-koo/learning_sequence_motifs.git}

Dependencies

Overview of the code

To generate datasets:

To train the models on the synthetic dataset and the in vivo dataset:

This script trains each model, evaluates the generalization performance of each model on the test set, and plots visualizations of 1st convolutional layer filters and saves a .meme file for the Tomtom search comparison tool. Each model can be found in /code/models/

To perform the Tomtom search comparison tool:

Requires Tomtom installation as well as command-line abilities from the current directory.

To calculate statistics across different initialization trials for each model, this script aggregates the matches to ground truth motifs:

To plot 2nd layer filters for CNN-1 and 2nd and 3rd layer filters for CNN-1-3:

To perform the Tomtom search comparison tool on the deeper layer filters:

To calculate statistics across different initialization trials for each model, this script aggregates the matches to ground truth motifs for 2nd layer filters for CNN-1 and CNN-1-3:

Overview of data

Overview of results