DANN: a deep learning approach for annotating the pathogenicity of genetic variants

cgreene commented 8 years ago

Paper needs to be read carefully for relevance https://dx.doi.org/10.1093/bioinformatics/btu703

cgreene commented 8 years ago

Biology: Aim discussed is to identify pathogenic variants.

Computational Methods

3 1000 node hidden layers
Input features precomputed (949 per variant)
Trained against both observed and simulated variants, and the goal of training is to differentiate simulated from observed.
A bit concerned about: models are trained on training, validation set evaluated to select the 'best' model, and models are regularly evaluated on the testing set to monitor for overfitting (Section 2.3) - does this potentially reduce our confidence in subsequent evaluations?
Comparison is to two linear methods. No non-linear kernel SVMs, for instance.

Results: "We also generated ROC curves showing the models discriminating pathogenic mutations defined by the ClinVar database (Baker, 2012) from likely benign Exome Sequencing Project (ESP; Fu et al., 2013) alleles with a derived allele frequency (DAF) 5% (Fig. 1b, n = 10 000 pathogenic/10 000 likely benign). \ Is this the same exact model as the one trained on observed vs simulated? do pathogenic look more like simulated or more like observed? **

Summary: predicting pathogenic variants is clearly important for our overall question ("What would need to be true for deep learning to transform how we categorize, study, and treat individuals to maintain or restore health?"). Right now, precisely how this was done in this paper remains a bit confusing to me: particularly whether or not the pathogenic model is the same as the observed/simulated model. I also have some relatively minor concerns around potential performance estimate issues due to training/testing breakdown. Definitely consider inclusion due to major topic relevance, though caveats may be important to discuss.

evancofer commented 8 years ago

Model: The supplementary materials (https://cbcl.ics.uci.edu/public_data/DANN/readme) indicate that the ClinVar/ESP (pathogenic/benign) set is for testing, not training. I therefore suspect that they used the model trained on observed/simulated data to classify the ClinVar/ESP data. There is mention of reusing the observed/simulated test set to combat overfitting, but it is not entirely clear whether the ClinVar/ESP test set was used in the same way.

cgreene commented 8 years ago

@evancofer - nice catch!

cgreene commented 8 years ago

This is an interesting paper. I've labeled it for the 'study' component. It's not receiving more discussion at this point so I've closed it. We're now using 'open' papers only for items undergoing active discussion.

agitter commented 7 years ago

I'm re-reading this to write about data simulation for the discussion. As far as I can tell, they are not doing anything new for the simulated data. It appears to come from the CADD paper. It's still worth discussing.

greenelab / deep-review

DANN: a deep learning approach for annotating the pathogenicity of genetic variants #5