hall-lab / svtyper

Bayesian genotyper for structural variants
MIT License
125 stars 55 forks source link

if CI is too big, use CI95 #80

Closed brentp closed 6 years ago

brentp commented 6 years ago

when the CI is large, we get a lot of very noisy calls. I just made a quick change and PR for discussion. I will evaluate this in smoove to see how performance changes. I can already see it removes about 20% of calls in my test case. The distribution goes from:

     26 BND
      8 DEL
    347 DUP
    185 INV

to:

     26 BND
      4 DEL
    254 DUP
    184 INV

I think the intuition is obvious, but just in case... when the CI gets too large, the genotyping just picks up a ton of noise and makes bad calls. In fact, I think that, given the slop that's already used, it might be good to try using CI95 exclusively. Any problems with this?

I will evaluate more thoroughly, but wanted to see if @ernfrid or any Hall lab folks (or @ryanlayer) had some thoughts on this.

ernfrid commented 6 years ago

Yes, that intuition makes sense to me as well, but I'll check with other Hall lab folks too.

Are the calls removed likely to be false positives based on your test set?

brentp commented 6 years ago

Yes, it appears so. I also lowered the number from 200 to 40 and then I get:

     26 BND
      4 DEL
    251 DUP
    181 INV

So, it seems it's not that the results don't change much from a lower cutoff--this is good, it might mean that we just have to avoid very large CIs to remove bad calls and the exact cutoff is not important.

I will do some evaluations on a truth-set and verify.

ernfrid commented 6 years ago

I ran a few tests on this as well and also see it primarily affecting DUPs. I'm not seeing as drastic a decrease as you are reporting here though. Are you able to comment on your test set? I'm curious how it differs from what I'm testing against (NA12878 illumina HiSeq X data aligned with Speedseq to build37).

brentp commented 6 years ago

You should see a change in speed though, no?

I am using data from the Ashkenazi trio kid from 10X data aligned with minimap2 to build 37. That combination seems to lead to more noise.

ernfrid commented 6 years ago

I'm re-running tests on a very large input file based on build 38 instead of 37. I'd neglected to time my tests the first go round, but am doing so now.

brentp commented 6 years ago

I actually ran a test on a huge cohort where I just use CI95 in all cases, never CIPOS, CIEND. The number of autosomal variants went from 34722 to 34692 and the number of apparent denovos dropped a bit, the number of rarish*, transmitted variants went from 8970 to 9098. So the differences are very small, but I think it's better to err on the side of having smaller intervals instead of grabbing too many reads that incidentally appear to support a variant.

ernfrid commented 6 years ago

Good to know Brent. I'll post my tests up here later, but I'm also seeing minor changes looking at NA12878 a few different ways and using your current pull request.

I've talked to other Hall Lab folks and there aren't any objections to doing this. I may make it a flag or option, but let's get this moving forwards.

brentp commented 6 years ago

sounds good.

it could be a flag like --max-ci-dist so if you set it to 0, it will always use ci95 and if it's 40, it will use the ci95 as soon as the span is > 40.

and it can default to 1e10 or something.

ernfrid commented 6 years ago

I've run 2 tests of this idea on NA12787 using our standard evaluations for sensitivity and precision. Sensitivity is calculated by comparing the resulting calls to 1000 Genomes variants detectable by non-readdepth based methods and precision by performing "long read validation "with PacBio and Moleculo data (virtually identically as described in the LUMPY paper).

I ran these tests on both raw lumpy output and on a regenotyping, after lmerge, of a small set of samples.

1) Lumpy output (post genotyping) from NA12878 (build37)

svtyper v0.6.1

1KG Sensitivity - 82.02%

Precision

BND - 313/1106 - 28.30%
DEL  - 2403/3097 - 77.59%
DUP - 126/447 - 28.19%
INV - 16/35 - 45.71%

svtyper v0.6.1 (nobigci)

1KG Sensitivity - 82.02%

Precision

BND - 313/1103 - 28.38%
DEL  - 2403/3093 - 77.69%
DUP - 126/444 - 28.38%
INV - 16/35 - 45.71%

2) A merged cohort-level VCF from 12 samples (Coriell trios; build37) svtyper v0.6.1

1KG Sensitivity - 88.14%

Precision

BND - 407/1690 - 24.08%
DEL  - 2745/3890 - 70.57%
DUP - 200/886 - 22.57%
INV - 49/101 - 48.51%

svtyper v0.6.1 (nobigci)

1KG Sensitivity - 88.14%

Precision

BND - 406/1679 - 24.18%
DEL  - 2742/3878 - 70.71%
DUP - 200/879 - 22.75%
INV - 49/101 - 48.51%

In terms of timing, I only timed a much larger regenotyping set and saw that the new code is slightly slower (622m vs 617m). Merged sets are probably not the best test for this though as most of the sites reach single basepair resolution.

In summary, I'm seeing really minor changes and it makes sense to me to include an option. I think you already saw the proposed code changes in #82. Tests pass for that code, but I haven't constructed a test of the new functionality yet.

brentp commented 6 years ago

thanks for looking into it! what you found seems reasonable, but I don't see how the new code could be slower.

ernfrid commented 6 years ago

I wouldn't read too much into it as it wasn't a rigorous comparison. They were run at different times, on different machines on our cluster and thus were not isolated from differences in network disk load and or other jobs running on the machines.

ernfrid commented 6 years ago

Superseded by #82