greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 271 forks source link

DeepSEA: Predicting effects of noncoding variants with deep learning–based sequence model #13

Closed cgreene closed 8 years ago

cgreene commented 8 years ago

https://dx.doi.org/10.1038/nmeth.3547

cgreene commented 8 years ago

Recent preprint evals compared to DeepSEA: http://dx.doi.org/10.1101/069682

Worth noting that in their TF binding site eval (Supplementary Figure 2), DeepSEA is still the top performing method. Also its nice to see this from an independent study.

agitter commented 8 years ago

Cross referencing that with #83. That issue is currently closed, but could be reopened if we want to use it.

cgreene commented 8 years ago

@agitter : Sorry for the failed cross-ref. Didn't even realize we had that paper already. Seems like we may want to discuss these two together since it might get to whether or not deep is transformational...

akundaje commented 8 years ago

@cgreene Whats the negative set they used for the TFBS prediction. Entirely unclear from reading the methods. Also was evaluation of the methods done on held out chromosomes not used in training? E.g. DeepSEA holds out chr8 and 9 and trains on all other chromosomes for all data types. So if they are evaluating performance on sites in the training chromosomes its going to be super-inflated. These benchmark comparisons are generally very poorly done and very poorly described. And of course once again auROC is reported. I would not consider this a reasonable comparative evaluation by any measure.

cgreene commented 8 years ago

@akundaje : The description isn't sufficient to determine how this evaluation was done. A quick e-mail to the authors might clarify.

cgreene commented 8 years ago

@akundaje : worth noting that the auROC that they report is in line with the DeepSEA pub: "We found that DeepSEA predicted chromatin features with high accuracy, including TF binding sites, for which the median area under the curve (AUC) was 0.958." This suggests to me that they retained the same eval (chr8 & 9) or that there wasn't much overfitting.

cgreene commented 8 years ago

[caveats with auROC desirability still apply, but we have to eval what we actually have]

gokceneraslan commented 8 years ago

In the multilabel/multitask setting, negative set of one TF is the binding site of all the others. So, I think it's quite clear. You can look at the Torch tensor that they provide for more stats on that.

gokceneraslan commented 8 years ago

Ah ok, I thought this is regarding Deepsea. Apparently it's about LINSIGHT.

cgreene commented 8 years ago

Yifei Huang replied to my e-mail with a helpful summary of the DeepSEA evaluation in the LINSIGHT paper:

We used all autosomes in our comparisons. I personally think DeepSEA is unlikely to overfit in our comparisons, since we used the DeepSEA functional significance score which was not trained using known TFs or disease variants. The DeepSEA functional significance score aggregated tissue-specific DeepSEA scores using polymorphism data and can be viewed as an indirect measurement of natural selection. Note that in the original DeepSEA paper, sometimes they trained meta-scores using known disease/eQTL variants and these meta-scores might overfit.

akundaje commented 8 years ago

DeepSEA models are trained on TF Chipseq data so I'm not sure what this means. Also I was specifically referring to the TF prediction task that they evaluate and not the variant scoring task. Anyway, I also posted comments on biorxiv.

On Oct 11, 2016 9:02 AM, "Casey Greene" notifications@github.com wrote:

Yifei Huang replied to my e-mail with a helpful summary of the DeepSEA evaluation in the LINSIGHT paper:

We used all autosomes in our comparisons. I personally think DeepSEA is unlikely to overfit in our comparisons, since we used the DeepSEA functional significance score which was not trained using known TFs or disease variants. The DeepSEA functional significance score aggregated tissue-specific DeepSEA scores using polymorphism data and can be viewed as an indirect measurement of natural selection. Note that in the original DeepSEA paper, sometimes they trained meta-scores using known disease/eQTL variants and these meta-scores might overfit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/13#issuecomment-252962091, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EZBZLTWqOjwU_7rUnucZuiijHRA5ks5qy7LogaJpZM4JdwjG .

cgreene commented 8 years ago

@akundaje : Agree that potential for overfitting exists for the TF eval. However, the TF eval that they do gives similar performance to the DeepSEA paper's TF eval IIRC (~.96). To me that suggests little overfitting, since they didn't holdout but DeepSEA did. Did your evals show DeepSEA overfitting if eval on all chromosomes? Sorry for brevity - posting b/w meetings.

akundaje commented 8 years ago

We haven't explicitly replicated the DeepSEA model but for instance the Basset model has much stronger prediction (in terms of auPRCs) on the training set than the validation or test set. Validation and test set performances are similar. But training performance is often much higher. auROCs always look much closer for training, validation and test as they are all inflated and in the 0.9 range. The auPRCs can diverge a lot. I dont know what the training set performance was for DeepSEA but I expect it will be much better (in terms of auPRC) than the validation and test sets.

On Tue, Oct 11, 2016 at 12:55 PM, Casey Greene notifications@github.com wrote:

@akundaje https://github.com/akundaje : Agree that potential for overfitting exists for the TF eval. However, the TF eval that they do gives similar performance to the DeepSEA paper's TF eval IIRC (~.96). To me that suggests little overfitting, since they didn't holdout but DeepSEA did. Did your evals show DeepSEA overfitting if eval on all chromosomes? Sorry for brevity - posting b/w meetings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/13#issuecomment-253026797, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7ERmEfL-u-GyTLULxsrZy6VjmfylGks5qy-nJgaJpZM4JdwjG .

cgreene commented 8 years ago

Totally agree that auPRC would be more likely to diverge than auROC. It would be great to have those figures for all of these methods.

cgreene commented 8 years ago

This one gets lots of discussion. We should probably talk about it - tagged for 'study'. The conversation around this one makes it clear to me that we also need to have at least a short section on evaluation. If we can get some people away from AUC in cases where it's not well suited, that'd be a huge win. Not sure if that should go in 'study' or a more general area. Opened #109 to make sure that this discussion makes it into our paper.