Closed brodie-mumphrey closed 11 months ago
The easiest thing to try would be using -B
. This way it will call something homozygous only if it has 6X the depth of the next best allele. The factor can be adjusted with -b
.
By default it uses a Poisson model based on background depth and a baseline error rate. You can try decreasing this (-e
) to see if it helps.
I think there's an underlying issue where these parameters were tuned on coverage in the 20-50X range, obviously aren't scaling well to your 1000X plus data.
Thank you for the help! I ran it with -B
, and it does indeed seem to cause these heterozygous, however with the error flag lowxadl
. I can ignore this flag, although is it expected that this would have such a low log likelihood? The same example variants from above:
K 517 >1603>1611 CGT CAT 415 lowxadl AT=>1603>1605>1608>1610>1611,>1603>1605>1607>1610>1611;DP=2026 GT:DP:AD:XADL:XAAD:MAD 1/0:2026:415,1611:-380.583836:1611:415
K 681 >1761>1775 TCTAC TCTGC 440.552 lowxadl AT=>1761>1763>1765>1768>1772>1774>1775,>1761>1763>1765>1768>1770>1774>1775;DP=2345 GT:DP:AD:XADL:XAAD:MAD 1/0:2345:441,1904:-495.460027:1904:440
K 692 >1780>1790 CAGG AACA,CACA 379.275 lowxadl AT=>1780>1784>1785>1787>1789>1790,>1780>1783>1785>1786>1788>1790,>1780>1784>1785>1786>1788>1790;DP=2026 GT:DP:AD:XADL:XAAD:MAD 1/2:2026:90,1557,379:-388.131187:1936:379
K 1018 >1964>1981 AGCATG AGGATG 412.566 lowxadl AT=>1964>1967>1971>1974>1976>1978>1980>1981,>1964>1967>1971>1973>1976>1978>1980>1981;DP=2553 GT:DP:AD:XADL:XAAD:MAD 1/0:2553:413,2140:-643.290537:2140:412
K 1074 >2024>2036 GTAC GTCC 443 lowxadl AT=>2024>2026>2030>2033>2035>2036,>2024>2026>2030>2031>2035>2036;DP=2913 GT:DP:AD:XADL:XAAD:MAD 1/0:2913:443,2470:-inf:2470:443
After looking more closely at the my VCF from the original post again, I am also noticing that the original issue only seems to affect sites that are highly multi-allelic. In the same region, VCF entries with only one or two alternate alleles are called as heterozygous as expected. Is it expected that having many alternate alleles would affect vg call
? Examples with one or two alternate alleles that are called heterozygous with similar read support, from the original VCF without using -B
:
K 993 >1951>1956 GG TG,TA 30873.9 PASS NS=618;DP=1984;AF=0.179612,0.00809061;AT=>1951>1953>1955>1956,>1951>1952>1955>1956,>1951>1952>1954>1956;AN=2;AC=1,0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:1984:382,1602:-3282.21,-195.297,-477.916,.,.,.:256:-1.09861:1669.23:382
K 1120 >2055>2058 C A 47715.9 PASS NS=618;DP=2939;AF=0.711974;AT=>2055>2057>2058,>2055>2056>2058;AN=2;AC=1 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:2939:479,2460:-5098.79,-327.67,-545.888:256:-1.09861:2891.56:479
K 1216 >2151>2155 GA CT,AC 39499.1 PASS NS=618;DP=2530;AF=0.480583,0.23301;AT=>2151>2154>2155,>2151>2153>2155,>2151>2152>2155;AN=2;AC=1,0 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:2530:481,2049:-4189.41,-239.972,-585.746,.,.,.:256:-1.79176:2737.25:481
K 1260 >2171>2174 C G 42512 PASS NS=618;DP=2667;AF=0.791262;AT=>2171>2173>2174,>2171>2172>2174;AN=2;AC=1 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:2667:469,2198:-4525.07,-274.335,-551.42:256:-1.09861:2587.62:469
K 1262 >2174>2177 A C 41706.9 PASS NS=618;DP=2629;AF=0.839806;AT=>2174>2176>2177,>2174>2175>2177;AN=2;AC=1 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:2629:471,2158:-4434.55,-264.323,-557.44:256:-1.09861:2587.62:471
K 1267 >2177>2180 G C 40410.3 PASS NS=618;DP=2548;AF=0.794498;AT=>2177>2179>2180,>2177>2178>2180;AN=2;AC=1 GT:DP:AD:GL:GQ:GP:XD:MAD 1/0:2548:457,2091:-4296.5,-255.938,-541.159:256:-1.09861:2587.62:457
I just looked at the code. lowaxdl
is coming from a binomial probability calculation. I think it might be having a numerical issue with your high coverage. If using -B
, my suggestion would be set everything as PASS
then threshold on MAD
(min allele depth) to remove low-quality calls.
For the multi-allele with the default options, perhaps it's getting a little confused with numerous very similar alleles. But it still seems like the default thresholds are tuned slightly differently than you expect. If you're able to share the data (graph and .pack) I can try to take a look to confirm.
Thank you for the help with this. I unfortunately am not able to share the data, but the -B
option seems to be solving this particular issue.
1. What were you trying to do?
Use
vg call
to genotype a graph.2. What did you want to happen?
Have heterozygous variants called as heterozygous.
3. What actually happened?
Variants were called homozygous despite hundreds of reads supporting both alleles. One allele has significantly less allele support due to allelic imbalance, so I understand that it may deviate further than expected from the neutral 50/50 expectation. Are there options to adjust this so that lower allelic fraction variants can still be called? Example output for a few affected variants:
4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:5. What data and command can the vg dev team use to make the problem happen?
6. What does running
vg version
say?