Opening up the blackbox: an interpretable deep neural network-based classifier for cell-type specific enhancer predictions

michaelmhoffman commented 8 years ago

Background. Gene expression is mediated by specialized cis-regulatory modules (CRMs), the most prominent of which are called enhancers. Early experiments indicated that enhancers located far from the gene promoters are often responsible for mediating gene transcription. Knowing their properties, regulatory activity, and genomic targets is crucial to the functional understanding of cellular events, ranging from cellular homeostasis to differentiation. Recent genome-wide investigation of epigenomic marks has indicated that enhancer elements could be enriched for certain epigenomic marks, such as, combinatorial patterns of histone modifications. Methods. Our efforts in this paper are motivated by these recent advances in epigenomic profiling methods, which have uncovered enhancer-associated chromatin features in different cell types and organisms. Specifically, in this paper, we use recent state-of-the-art Deep Learning methods and develop a deep neural network (DNN)-based architecture, called EP-DNN, to predict the presence and types of enhancers in the human genome. It uses as features, the expression levels of the histone modifications at the peaks of the functional sites as well as in its adjacent regions. We apply EP-DNN to four different cell types: H1, IMR90, HepG2, and HeLa S3. We train EP-DNN using p300 binding sites as enhancers, and TSS and random non-DHS sites as non-enhancers. We perform EP-DNN predictions to quantify the validation rate for different levels of confidence in the predictions and also perform comparisons against two state-of-the-art computational models for enhancer predictions, DEEP-ENCODE and RFECS. Results. We find that EP-DNN has superior accuracy and takes less time to make predictions. Next, we develop methods to make EP-DNN interpretable by computing the importance of each input feature in the classification task. This analysis indicates that the important histone modifications were distinct for different cell types, with some overlaps, e.g., H3K27ac was important in cell type H1 but less so in HeLa S3, while H3K4me1 was relatively important in all four cell types. We finally use the feature importance analysis to reduce the number of input features needed to train the DNN, thus reducing training time, which is often the computational bottleneck in the use of a DNN. Conclusions. In this paper, we developed EP-DNN, which has high accuracy of prediction, with validation rates above 90 % for the operational region of enhancer prediction for all four cell lines that we studied, outperforming DEEP-ENCODE and RFECS. Then, we developed a method to analyze a trained DNN and determine which histone modifications are important, and within that, which features proximal or distal to the enhancer site, are important.

http://doi.org/10.1186/s12918-016-0302-3

michaelmhoffman commented 8 years ago

This is simply predicting enhancer locations from histone modification data, not the much more interesting question of deciding which enhancers affect which genes.

gwaybio commented 8 years ago

@michaelmhoffman - it seems like other methods have this goal as well (see #61 and #20)

(side note - definitely can agree that is a much more interesting question)

gwaybio commented 8 years ago

Deep feed forward neural network with dropout trained on 24 FPKM histone modifications as assayed by ENCODE. Named their method Enhancer Prediction Deep Neural Network (EP-DNN).

Biology

Predict enhancers using chromatin marks. Gold standard positives are p300 binding sites. Gold standard negatives are TSS and non DNase hypersensitivity sites. Trained four different models corresponding to four different cell types.

Computational aspects

Deep neural network with three hidden layers, one raw input layer, and a single neuron output with softplus.
Histone features binned at 100bp
Trained with 0.5 dropout rate and mini-batches = 100
They performed an iterative feature selection procedure that selected the highest contributing histone features to predictive performance
- These histone marks differed across different cell types.
- With feature selection training and testing time decreased

General Comments

Contrasts with #61 because it uses raw histone features and not SVM confidence scores in DNN
Uses less heterogeneous data than #20 but is also a simpler model
Uses custom performance metric called "validation rate"
- Define a true positive if their prediction is within 2.5 kb of a "true positive marker" and overlaps with a DNase hypersensitivity site
Marginally faster than #61 to train

I am getting a sense that a couple things need to happen before deep learning can be bring enhancer finding to the next level:

Define predictive features to use and establish a good "feature window"
- Other enhancer predictors use variably sized windows
Determine standard performance metrics
- Very difficult to compare different methods when each use different evaluation implementations
  - What is a successful hit?
  - How are successful hits measured against unsuccessful hits?
Reframe optimization goal
- Gold standards are poorly defined
- Perhaps matching enhancers to target genes is a better optimization goal and one that is better able to be molecularly validated and scored

greenelab / deep-review