torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
675 stars 124 forks source link

CIGAR strings differ between samout and other outputs #259

Open frederic-mahe opened 7 years ago

frederic-mahe commented 7 years ago
vsearch \
    --usearch_global <(printf '>query\nAAGGGGGGGGGCCC\n') \
    --db <(printf '>target\nAAGGGGAAAAGGGGCC\n') \
    --minseqlength 1 \
    --quiet \
    --id 0.1 \
    --userfields caln \
    --userout - \
    --samout - \
    --alnout -
Qry  1 + AAgggg---gggggCC 13
         ||||||    ||||||
Tgt  1 + AAGGGGAAAAGGGGCC 16

SAM:  6M3D7M1I
caln: 6M3I7MD

SAM's CIGAR strings require a number between each letter ("7M1I" instead of "7MD"), but the main different is in the "point-of-view".

SAM's CIGAR strings encode the target modifications needed to equal the query, whereas CIGAR strings in other output formats encode the query modifications needed to equal the target.

If this is confirmed, that should be indicated in the documentation.

torognes commented 7 years ago

This can be confirmed.

Here is the SAM format specification:

https://samtools.github.io/hts-specs/SAMv1.pdf

Other sources of information:

https://doi.org/10.1093/bioinformatics/btp352 http://www.drive5.com/usearch/manual/cigar.html

The spec indicates that the CIGAR string in the SAM format is in the direction from the reference (target) to the query. Deleted symbols are only found in the reference (target), while inserted symbols are only found in the query. The point of view may be different in other formats. I have made the output similar to USEARCH.

The spec does not say, but in SAM files there always seems to be an '1' in front of operations that happen once, but in other contexts this '1' is sometimes skipped.