Closed brendanofallon closed 5 years ago
Suspicious that all of the reads are on the reverse strand. Is this WGS or WES? Octopus will usually ignore variants that are so highly strand biased as they are indicative of sequencing or mapping error. I could potentially tweak the logic to include these 'cleaner' strand-biased variants, but they'd almost certainly get filtered in any case.
The data are from a large targeted panel - but I'm not sure that this variant is strand biased. The reads all go in one direction (we're on the right edge of a targeted region), but the number of alt-supporting reads isn't strongly biased in one direction relative to ref-supporting reads. (From what I understand, strand bias isn't a test of how close the forward / reverse ratio is to 50/50, it's a Chi-squared or FIsher's exact-like test to measure if alt-supporting reads are distributed similarly to ref-supporting reads across read directions. In the case above where there aren't any forward reads, the test probably cannot be performed meaningfully) I've come across a handful of other false negative SNVs in similar regions where all the reads are in one direction. Happy to provide more small .bams or something if it would be helpful.
You're right that "strand bias" usually describes the difference in proportion of read direction supporting two or more distinct haplotypes. Octopus does compute a statistic of this sort of strand bias for filtering purposes. But as you point out, this quantity is not meaningful in the absence of reads in one direction or for homozygous calls. For WGS, we expect a binomial distribution of read directions supporting all called haplotypes, so observing a haplotype on just one strand is suggestive that the either the call is incorrect (e.g. due to a missed structural variant). It's therefore informative (about the validity of a call) to consider just the proportion of supporting read directions, which I also called "strand bias" for want of a better term. For targeted sequencing, we don't expect a binomial distribution of read directions for off-target regions, but then we cannot reasonably assign high confidence to calls in these regions in any case. I'd argue that such regions should be included in the panel, if variation within them is important.
That said, I can appreciate that some might want such variants to be called, even if they are filtered. I've therefore modified the variant discovery criteria to allow variants only seen in one direction if there are only reads in that direction overlapping the region (available in v0.5.3-beta). This should cause the variants that you mention to be called.
Using 0.7.4 docker image of octopus. Running into this:
The ploidy is 6 instead of 2. The middle mismatches of 4-bp deletion and the SNV are both captured but failed to have a valid GT (in the output both are recorded as ./.) The overlapping reads MQ and BQ are great, no strand bias.
GGGGA G 5884.22 RF . GT:AD:DP:FT ./.:67,40:152:RF G A 5379.3 RF . GT:AD:DP:FT ./.:67,49:151:RF
Here are the command line arguments I used:
/opt/octopus/bin/octopus \ -R ${ref_genome} \ -I ${bam_file} \ -o ${output_vcf} \ --trace ${output_vcf/.vcf*/.trace.log} \ --debug ${output_vcf/.vcf*/.debug.log} \ --bamout ${bam_out} \ --bamout-type FULL \ ${region_arg} \ --threads ${threads} \ --consider-unmapped-reads \ --min-mapping-quality 15 \ --duplicate-read-detection-policy RELAXED \ --disable-downsampling \ --variant-discovery-mode ILLUMINA \ --organism-ploidy ${ploidy} \ --read-linkage PAIRED \ --caller ${caller_mode} \ --min-forest-quality ${forest_vq_filter} \ --forest ${forest} \ --annotations AD
A bit confused why this happened
Describe the bug Octopus appears to be missing a simple, fairly high coverage homozygous SNV call
Command Command line to run octopus:
Desktop (please complete the following information):
Additional context In testing against GIAB samples I came across a few false negative variants that appear readily in IGV, for instance there's a fairly clear homozygous C>G SNV at position 1:3318769 (obvious in attached screenshot). However, the variant does not appear in the VCF output. Other variants in the region are detected correctly. I'm probably missing something obvious here... Happy to provide a small .bam file of the region as well - not sure how to include it here (it's about 100kb)