greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 270 forks source link

DEEP: a general computational framework for predicting enhancers #61

Closed michaelmhoffman closed 6 years ago

michaelmhoffman commented 8 years ago

Transcription regulation in multicellular eukaryotes is orchestrated by a number of DNA functional elements located at gene regulatory regions. Some regulatory regions (e.g. enhancers) are located far away from the gene they affect. Identification of distal regulatory elements is a challenge for the bioinformatics research. Although existing methodologies increased the number of computationally predicted enhancers, performance inconsistency of computational models across different cell-lines, class imbalance within the learning sets and ad hoc rules for selecting enhancer candidates for supervised learning, are some key questions that require further examination. In this study we developed DEEP, a novel ensemble prediction framework. DEEP integrates three components with diverse characteristics that streamline the analysis of enhancer's properties in a great variety of cellular conditions. In our method we train many individual classification models that we combine to classify DNA regions as enhancers or non-enhancers. DEEP uses features derived from histone modification marks or attributes coming from sequence characteristics. Experimental results indicate that DEEP performs better than four state-of-the-art methods on the ENCODE data. We report the first computational enhancer prediction results on FANTOM5 data where DEEP achieves 90.2% accuracy and 90% geometric mean (GM) of specificity and sensitivity across 36 different tissues. We further present results derived using in vivo-derived enhancer data from VISTA database. DEEP-VISTA, when tested on an independent test set, achieved GM of 80.1% and accuracy of 89.64%. DEEP framework is publicly available at http://cbrc.kaust.edu.sa/deep/.

http://doi.org/10.1093/nar/gku1058

gwaybio commented 8 years ago

Goals

Use histone modifications and/or DNA sequence elements to predict enhancers in ENCODE, FANTOM5, and VISTA data

Biology

Computational aspects

Strengths

General comments and main concerns

I am struggling to grasp what the neural network is actually doing in this paper. Confidence scores from each ensemble model is the raw input to the neural network. One could imagine the decision function learned from the confidence score input would be to have neurons that aggregate poor performing classifiers and other that amplify good performing ones. Anything else it could capture?

The authors compare this approach to a MUCH simpler majority voting technique in supplementary figure 1 but do not go beyond this discussion. While an interesting idea, I can image that this would be a pain to implement at test time. Because the network topology expects a certain confidence score to be associated with a given raw input, there has to be careful consideration that the data splits for each ensemble model is exactly the same.

I am not sure if we can classify this paper as deep learning.

Their treatment of cross validation is interesting and the rationale is not described well enough. For each SVM they train on 20% of the data and evaluate performance on the remaining 80%. To me, this sounds like an ensemble of weak learners - which could be good while greatly reducing train time. Instead of inputting confidence scores into NN, could they use weighted majority voting? (weighted by evaluation performance)