Do D and I have inverted roles in CIGAR strings?

smarco / WFA2-lib

WFA-lib: Wavefront alignment algorithm library v2

Other

162 stars 36 forks source link

Do D and I have inverted roles in CIGAR strings? #46

Closed marcelm closed 1 year ago

marcelm commented 1 year ago

Running wfademo.cpp, I noticed that the meaning of D and I in the CIGAR output seems to have been swapped from their usual meaning. Here’s an example taken from the README:

    PATTERN    AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
               | ||| |||||  ||||||||   | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M3I1M1X9M

The README states that text is equivalent to reference and pattern equivalent to query (which makes sense). If I take the above pattern to be a sequencing read and the text to be a genome reference, then the two gaps would be considered to be deletions, but they are encoded as 1I and 3I, respectively. Or should I think about this differently?

smarco commented 1 year ago

Hi,

This I can answer right away. The WFA2lib follows the convention that describes how to transform the Pattern/Query into the Text/Database/Reference (as in classic pattern matching papers). However, the SAM CIGAR standard works the other way around (as the Reference is the important sequence). Beyond the discussion of which one is better (I think they are both ok), if you want CIGAR-style alignments, just swap pattern <-> text sequences when calling the WFA align function, and you will get all the Ds converted into Is (and vice-versa).

Let me know if that helps.

marcelm commented 1 year ago

Thanks! I see. Would you consider adding a comment to the README to make this clear for others as well?

Swapping pattern and text is of course the simplest fix for this, and it is what I’m using at the moment.

smarco commented 1 year ago

Sure (sorry for the delay). Please, have a look into development and let me know if that feels more clear.

Thanks,

marcelm commented 1 year ago

Thanks, that is clear enough!