mroosmalen / nanosv

SV caller for nanopore data
MIT License
90 stars 22 forks source link

RF filtering model availability #46

Closed MaestSi closed 6 years ago

MaestSi commented 6 years ago

Dear NanoSV developers, I recently read with interest your Nature Comm. publication "https://www.nature.com/articles/s41467-017-01343-4". I would like to know if the post-calling filtering random forest model is built-in in NanoSV package, or if it is available as a supplementary software.

Moreover, I would like to ask for a clarification about sentences: "We obtained validation status for 274 SVs, regardless of the random forest prediction outcome, for Patient1, and 77 SVs predicted as true by the random forest for Patient 2. Based on these sets, we obtained precision estimates of 95 and 96% for Patient 1 and Patient 2 and a sensitivity estimate of 72% for Patient 1". It is not clear to me if for Patient 1 precision and sensitivity estimates, you are also taking into account the random forest filtering (I think so), which would probably result in higher precision but lower sensitivity, and why sensitivity estimate for Patient 2 is not reported (maybe because no orthogonal extensive validation was performed?). Based on reported information it seems to me that for Patient 1 you are taking into account the random forest filtering, otherwise precision would be 185/274, if I understood it correctly. Maybe you estimated sensitivity for Patient 1 based on the 40 verified-by-orthogonal-bp-assays SVs? So, it means that, after RF model filtering, about 29 of the reported SVs were retained from filtering. Is it right?

Thank you very much

mroosmalen commented 6 years ago

We are planning to make some supplementary software in the future, including a random forest model.

For Patient1 we selected 274 random SV's ( 140 predicted as TRUE (PT) and 134 predicted as FALSE (PF) by the random forest ). Out of the 140 PT, 133 were validated as TRUE (PTVT) and 7 as FALSE (PTVF). Out of the 134 PF, 52 were validated as TRUE (PFVT) and 82 as FALSE (PFVF).

The sensitivity is calculated by PTVT/(PTVT+PFVT) = 133/(133+52) and the precision by `PTVT/(PTVT+PTVF) = 133/(133+7).

For Patient2 we only included PT and not PF for validation and that why we did not report the sensitivity.

MaestSi commented 6 years ago

Ok, thank you very much for the kind explanation.

So, just to clarify further, the "validation data [...] of 274 SVs (185 true positives and 89 false positives)" is actually composed of 185 true variants and 89 false variants. The number of true positives is actually 133, the number of false negatives is 52 (which together account for 185 true variants), the number of true negatives is 82 and the number of false positives is 7 (which together account for 89 false variants). The concept of "true/false positive" made me think about the result of a prediction, while the split 185 VS 89 is based on the real presence/absence of the variant, before any software predictions.

I'm looking forward to hear about some new software from you! Best