hall-lab / svtyper

Bayesian genotyper for structural variants
MIT License
125 stars 55 forks source link

Best FORMAT field for filtering SVTyper Output #95

Open andrewSharo opened 6 years ago

andrewSharo commented 6 years ago

Hi Dave, I'm running SVTyper on output from Lumpy for hg38 whole genome samples with 50x depth, interested in finding Deletions and other SVs with a low false-positive rate. The purpose is for personal genome diagnostics. Output from SVTyper gives about 10,000 SVs that are 0/1 or 1/1 genotype. Which fields do you recommend for further filtering? SVTyper gives a number of helpful values (RO, AO, QR, QA, AB, etc) but I'm curious which is recommended for filtering. I would like to find SVs that are supported by coverage changes. Best, Andrew

ernfrid commented 6 years ago

My experience is generally on large cohorts as opposed to single samples. Perhaps @brentp would have some advice on what the Quinlan lab is doing for individuals or small cohorts. One recommendation I can definitely make is to use the smoove wrapper for Lumpy and svtyper. It reduces the false positive rate significantly by performing stringent filtering on the inputs to Lumpy.

For large cohorts, we typically filter on mean sample quality (SQ). I would think similarly filtering on SQ for an individual would prove fruitful (I'd think a cutoff of ~100 might be a good place to start, but you'll likely need to tune a bit). Additionally, since you're primarily interested in coverage changes, I would annotate your candidate SV with the copynumber of the call region. We've typically been doing this by running cnvnator and then annotating the copynumber using the svtools copynumber command. See the copynumber annotation section of the svtools tutorial. Brent also recently released a new tool called duphold that I haven't tried out, but may be useful to you.

brentp commented 6 years ago

thanks for mentioning smoove and duphold @ernfrid

smoove has an annotate sub-command that will help you prioritize variants. It adds an SHQ (smoove het-quality) to the FORMAT fields and MSHQ (mean ...) to the INFO. You can see more about this in the README.

In the next month there should be more improvements to smoove that reduce false positives by incorporating duphold and a few other tricks.