Closed whitwham closed 1 year ago
Some small corrections, and a suggestion which I can take or leave either way.
It could also do with test data demonstrating the problem, but I can't reproduce it! I'll take a look at the original issue and see if I can get something smaller to test on.
To replicate the original error you also need the version of bcftools without the fix. This change keeps the ID=16 numbers the same, the bcftools one has the alternative bases.
Also, the test data the provided by the users was the smallest they could produce to show the error.
I've managed to produce something really tiny now, starting with the original test data provided. I think that's just simulated data, but I don't know the license on it so will try and figure out the cause of it going wrong to produce our own.
The easiest way I found of reproducing it also, and what we'd need here, is via test/test_realn. We already have test files, so we can just add one more with two entries using =/X vs M.
Here is some test data.
test/realn03.fa
>MX
CGTCTACTACG
test/realn03.sam
@HD VN:1.6 SO:coordinate
@SQ SN:MX LN:11
M 64 MX 1 60 11M * 0 0 CGTCTCCTACG IIIIIIIIIII
X 64 MX 1 60 5=1X5= * 0 0 CGTCTCCTACG IIIIIIIIIII
Then run:
./test/test_realn -e -f test/realn03.fa -i test/realn03.sam
You should see the same BQ tag in both cases, but don't with develop. There are already test files for test_realn, so this can just be another 1 to add and put into test.pl
I've verified it's only affecting extended mode, which tallies with the code changes.
Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M). This change compresses the multiple operations into one.
In combination with samtools/bcftools#1914 it should fix #1597.