Biology

The authors are interested in predicting if an miRNA binds and regulates a gene. They generate 20 features based on complementary sequences, binding affinity/accessibility, and conservation scores for miRNA-mRNA pairs found in TargetScanS and TarBase datasets.

Computational Aspects

They implement a CNN with two convolutional layers with mean pooling and a kernel size 3. They use constraint relaxation to overcome class imbalance (in this case, there are more experimentally validated positives than negatives). Their method defines four distinct datasets based on different evidence for each pair and confidence in the miRNA-mRNA regulation where one dataset is negative. The CNN then takes as input the different miRNA-mRNA features with a goal of classifying each input into one of the four datasets. They use an experimentally validated test set to validate performance.

Summary

Very nice discussion about history of computational approaches to define miRNA-mRNA pairs.
- Generally, in almost every primary article I've read there has a been a good historical perspective to the given problem.
High amount of domain knowledge required to engineer features
- This means that if a user wanted to apply MiRTDL they would likely need to first define specific features for their miRNA or mRNA of interest
- This could severely impact usability of the algorithm
The limitations to this paper are only discussed in two very short sentences in the discussion
- No mention of applying CNN to unstructured data (talking points we've already discussed in #79)
- They apply mean pooling - what is this really doing?
- They do mention that 20 features is not enough for a deep learning algorithm
- However, to overcome this it appears that they resample features to increase data to 64 features
  - Very unclear exactly what they are doing here
I thought their constraint relaxation approach was interesting and it looks like it improves performance.
- not sure what it means for a new miRNA or mRNA but it does seem to overcome class imbalance issues
With only 20 features they were able to perform a very nice analysis of which features impact performance the most.
- They applied their model with a leave-one-out methodology and observed performance differences
- Sequence complimentarity was the highest contributing feature

agitter commented 8 years ago

It's hard to understand their input data from Section 2.4. As @gwaygenomics said, they try resample the features to get 64, 196, 484, or 900 features. Figure 2 and the text suggest that they treat the 196 features as a 2D input (14x14) but when describing Figure 3d they say the features are a 1D array. This potentially makes the application of CNN to unstructured data much worse than #79. In #79 I ultimately think the CNN makes sense given how they constrained the CNN architecture.

greenelab / deep-review

MiRTDL: a deep learning approach for miRNA target prediction #49

Biology

Computational Aspects

Summary