hall-lab / svtyper

Bayesian genotyper for structural variants
MIT License
126 stars 55 forks source link

Interpreting Genotype Likelihoods of Complex Structural Variants #10

Closed wb223495 closed 9 years ago

wb223495 commented 9 years ago

We have performed WGS at 40x of a number of families and processed structural variants through the speedseq v0.0.3a pipeline.

I was wondering how to interpret genotype-likelihoods (GL) / genotype-quality (GQ) scores, in particular for more complex structural variants than simple deletions, duplications, and inversions?

We identified a few de novo complex structural variants with very poor GL / GQ scores (e.g. 0 GQ), however we went ahead and successfully validated them through PCR.

At the other end of the scale, we have breakpoints that are called in almost everyone and have 0 genotype quality scores, however a few individuals may have GQ scores > 50. These are probably erroneous.

Also, we have complex events with for example two pairs of breakpoints, where one pair has a 0 GQ score, and the other has a reasonable GQ score (>50, n.b. I grouped breakpoints into single events based on their proximity and overlap, e.g. two pairs of breakpoints overlap and one end of each pair are adjacent to each other in opposing orientations).

I suspect it is very challenging to accurately estimate GLs in complex events that have multiple signals at the same locus.

My temptation is to keep anything denovo regardless of GQ score if it involves >1 pair of breakpoints, since the numbers are manageable. I would also keep anything else with >1 pair of breakpoints where at least one pair of breakpoints has a good median GQ score (>50 or >100?) across individuals that have this pair of breakpoints in the cohort. Otherwise filter any events with consistently low or zero GQ scores. Please advise!

cc2qe commented 9 years ago

Thanks for the feedback on validation of these events with PCR.

First, we have made improvements to SVTyper since the speedseq v0.0.3a pipeline. So even if you're using SpeedSeq for BAM and LUMPY data processing, I'd recommend running it through the more recent SVTyper on this github page.

GQ and GL scores are probably not the best metric for evaluating variant quality. Both of these describe the relative likelihood that the genotype is correct compared to other possible genotypes (0/0, 0/1, 1/1). Thus, a variant may have a poor GQ even though it is almost certainly non-reference because SVTyper is unsure whether it is 0/1 or 1/1.

On a per-sample basis, the "sample quality" (SQ) shows phred quality that the site is non-reference in that particular sample. (The QUAL score, column 6 of the VCF, is the analogous value that the site is non-reference in ANY sample in the VCF.) SQ is the best metric for distinguish reference from non-reference sites.

Even SQ may have some problems at complex events however, as these can violate the assumptions of diploidy in the model. If you're particularly interested in these, then you might want to try the current dev branch version of SVTyper, which outputs the "allele balance" (AB) (alternate/total observations) at each site. This allows more sophisticated filtering on raw counts and is agnostic to ploidy and genotyping model.

wb223495 commented 9 years ago

Awesome, thanks.

I will use SQ from now on. In the future I will also use a more updated version of speedseq and SVtyper.

wb223495 commented 9 years ago

Hi Colby,

I reran the genotyping using v0.0.2 of SVtyper. The results look much better, and the complex de novo events genotyped well with high SQ scores.

I am genotyping individuals within families. I’d like to combine the results for each family into one file and compute a quality score for each SV locus. I was wondering how this quality score (column 6 in the VCF) is calculated? Could I generate it myself in the combined data without genotyping everyone in our cohort for every variant with SVtyper?

cc2qe commented 9 years ago

QUAL is -10 * log(P(locus is reference is all samples)), which is equal to the sum of the SQ scores.

If you have SQ scores for each individual at the locus, you can just add them together to get the QUAL score of the variant.

Sithara85 commented 6 years ago

Hi Colby,

We are using Speedseq 0.1.2 to process the whole genome For ex: NA12877. We are getting many calls with QUAL = 0. Can we just filter them from the vcf file after processing through speedseq workflow? What does the QUAl score or zero or near zero means?

I have seen for speedseq var/ somatic call you have an option to filter the output by QUAL score using -q? Is it applicable in Speedseq SV pipeline?

Thank you, Sithara

cc2qe commented 6 years ago

Hi Sithara,

The QUAL score is meaningful for variants that have been genotyped with SVTyper (which can be done as an option through SpeedSeq). We have used varying thresholds for QUAL in publications (≥100 in the SpeedSeq paper and ≥20 in the GTEx paper), and it is generally appropriate to filter low quality variants from the VCF file. In general, low quality SVs reflect poor alignments to the reference genome. These may be interpreted by LUMPY as non-reference events (which is why they are included in the VCF), but more sensitive inspection by SVTyper finds no convincing evidence of a true SV

Hope that helps, Colby

Sithara85 commented 6 years ago

Hi Colby, Thank you for the detailed response. We decided to go with SVs with QUAl score of >100. Currently I am doing a validation of genome NA12877 sequenced 100X Coverage, with down sampled genome of 30X and 10X and looking at deletions. We are filtering out SVs based on AB (Allele Balance from VCF file), defined heterozygous deletions based on AB of <= 0.3 and <= 0.7 and homozygous deletions with AB > 0.7, considering AB instead of VAF (Variant Allele Frequency). Have you done any validation of Speedseq at low coverage samples? We were expecting that at low coverage we will miss to see small SVs. But considering the combination method for SV call, is this assumption true for Speedseq?

Thank you, Sithara

cc2qe commented 6 years ago

Hi Sithara,

That strategy sounds reasonable to me. We have not assessed SVTyper performance by coverage depth. But I agree that lower coverage will decrease call sensitivity, and this may be more pronounced for small deletions. I'm assuming by "combination method" you're referring to the read-depth + SVTyper analysis, and this expectation will remain true under that model as well.

--Colby

Sithara85 commented 6 years ago

Hi Colby,

I did a validation on just deletions at 100X, 30X and 10X, our assumption of lower coverage will reduce the call sensitivity is not true with my analysis. By combination method I meant to say Speedseq uses read depth, split read, read pair and SVtyper breakpoint detection - which could be the reason that we didn't see much reduction in call sensitivity at reduced coverage.

Thank you, Sithara

On Tue, Aug 21, 2018 at 1:32 PM, Colby Chiang notifications@github.com wrote:

Hi Sithara,

That strategy sounds reasonable to me. We have not assessed SVTyper performance by coverage depth. But I agree that lower coverage will decrease call sensitivity, and this may be more pronounced for small deletions. I'm assuming by "combination method" you're referring to the read-depth + SVTyper analysis, and this expectation will remain true under that model as well.

--Colby

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hall-lab/svtyper/issues/10#issuecomment-414776888, or mute the thread https://github.com/notifications/unsubscribe-auth/AfYfyx-SgtApsBRCRbAcnwQ-J2I8VjMQks5uTFIrgaJpZM4E1fwL .

--

With regards,

Sithara Vivek,

Mobile - 224 522 2289