samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Amalgamate multiple CIGAR ops into single entry. #1607

Closed whitwham closed 1 year ago

whitwham commented 1 year ago

Multiple matching (or sequence (mis)matching)) ops (e.g. 10M40M) give a different VCF using BAQ than a single operation of the same length (e.g. 50M). This change compresses the multiple operations into one.

In combination with samtools/bcftools#1914 it should fix #1597.

whitwham commented 1 year ago

Some small corrections, and a suggestion which I can take or leave either way.

It could also do with test data demonstrating the problem, but I can't reproduce it! I'll take a look at the original issue and see if I can get something smaller to test on.

To replicate the original error you also need the version of bcftools without the fix. This change keeps the ID=16 numbers the same, the bcftools one has the alternative bases.

Also, the test data the provided by the users was the smallest they could produce to show the error.

jkbonfield commented 1 year ago

I've managed to produce something really tiny now, starting with the original test data provided. I think that's just simulated data, but I don't know the license on it so will try and figure out the cause of it going wrong to produce our own.

The easiest way I found of reproducing it also, and what we'd need here, is via test/test_realn. We already have test files, so we can just add one more with two entries using =/X vs M.

jkbonfield commented 1 year ago

Here is some test data.

test/realn03.fa

>MX
CGTCTACTACG

test/realn03.sam

@HD VN:1.6  SO:coordinate
@SQ SN:MX   LN:11
M   64  MX  1   60  11M *   0   0   CGTCTCCTACG IIIIIIIIIII
X   64  MX  1   60  5=1X5=  *   0   0   CGTCTCCTACG IIIIIIIIIII

Then run:

./test/test_realn -e -f test/realn03.fa -i test/realn03.sam

You should see the same BQ tag in both cases, but don't with develop. There are already test files for test_realn, so this can just be another 1 to add and put into test.pl

I've verified it's only affecting extended mode, which tallies with the code changes.