Closed cgreene closed 8 years ago
Biology: Aim discussed is to identify pathogenic variants.
Computational Methods
Results: "We also generated ROC curves showing the models discriminating pathogenic mutations defined by the ClinVar database (Baker, 2012) from likely benign Exome Sequencing Project (ESP; Fu et al., 2013) alleles with a derived allele frequency (DAF) 5% (Fig. 1b, n = 10 000 pathogenic/10 000 likely benign). \ Is this the same exact model as the one trained on observed vs simulated? do pathogenic look more like simulated or more like observed? **
Summary: predicting pathogenic variants is clearly important for our overall question ("What would need to be true for deep learning to transform how we categorize, study, and treat individuals to maintain or restore health?"). Right now, precisely how this was done in this paper remains a bit confusing to me: particularly whether or not the pathogenic model is the same as the observed/simulated model. I also have some relatively minor concerns around potential performance estimate issues due to training/testing breakdown. Definitely consider inclusion due to major topic relevance, though caveats may be important to discuss.
Model: The supplementary materials (https://cbcl.ics.uci.edu/public_data/DANN/readme) indicate that the ClinVar/ESP (pathogenic/benign) set is for testing, not training. I therefore suspect that they used the model trained on observed/simulated data to classify the ClinVar/ESP data. There is mention of reusing the observed/simulated test set to combat overfitting, but it is not entirely clear whether the ClinVar/ESP test set was used in the same way.
@evancofer - nice catch!
This is an interesting paper. I've labeled it for the 'study' component. It's not receiving more discussion at this point so I've closed it. We're now using 'open' papers only for items undergoing active discussion.
I'm re-reading this to write about data simulation for the discussion. As far as I can tell, they are not doing anything new for the simulated data. It appears to come from the CADD paper. It's still worth discussing.
Paper needs to be read carefully for relevance https://dx.doi.org/10.1093/bioinformatics/btu703