maickrau / GraphAligner

MIT License
256 stars 30 forks source link

Clarification of CIGAR string use #19

Closed hgibling closed 3 years ago

hgibling commented 4 years ago

Hello,

Am I correct in interpreting the CIGAR string for the README example alignment as 4 matches, 1 mismatch, 2 matches, 1 mistmatch, 1 insertion, 37 matches, etc?

read 71 0 71 + >1>2>4 87 3 73 66 72 255 NM:i:6 dv:f:0.0833333 id:f:0.916667 cg:Z:4M1M2M1M1I37M1D5M1I5M1M13M

Generally M is used for either matches or mistmatches, so I'm used to seeing CIGAR strings like 8M1I37M1D5M1I19M. Is there a reason = and X aren't used to clarify sequence matches and mismatches instead? I'm thinking of cases like 1M1M1M1M1D where the placement of mismatches is ambiguous.

ekg commented 4 years ago

This might be a case of the cigar not being compressed.

Might I also suggest the use of the "cs" tag, as frok minimap2? This would allow the reconstruction of the query sequence from graph path and the cs tag. This is being implemented in vg. The behavior matches that of GAM's alignment description.

On Thu, May 28, 2020, 20:14 Heather Gibling notifications@github.com wrote:

Hello,

Am I correct in interpreting the CIGAR string for the README example alignment as 4 matches, 1 mismatch, 2 matches, 1 mistmatch, 1 insertion, 37 matches, etc?

read 71 0 71 + >1>2>4 87 3 73 66 72 255 NM:i:6 dv:f:0.0833333 id:f:0.916667 cg:Z:4M1M2M1M1I37M1D5M1I5M1M13M

Generally M is used for either matches or mistmatches, so I'm used to seeing CIGAR strings like 8M1I37M1D5M1I19M. Is there a reason = and X aren't used to clarify sequence matches and mismatches instead? I'm thinking of cases like 1M1M1M1M1D where the placement of mismatches is ambiguous.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maickrau/GraphAligner/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELR3PX5AP4E7F4ZFI3RT2SYVANCNFSM4NNJ7CLA .

maickrau commented 4 years ago

Hi,

Am I correct in interpreting the CIGAR string for the README example alignment as 4 matches, 1 mismatch, 2 matches, 1 mistmatch, 1 insertion, 37 matches, etc?

You are correct.

Generally M is used for either matches or mistmatches, so I'm used to seeing CIGAR strings like 8M1I37M1D5M1I19M. Is there a reason = and X aren't used to clarify sequence matches and mismatches instead? I'm thinking of cases like 1M1M1M1M1D where the placement of mismatches is ambiguous.

There's no real reason, it's just an artifact of how the cigar string is built from the backtrace. I agree that using the more unambiguous characters could be a better idea.

Might I also suggest the use of the "cs" tag, as frok minimap2? This would allow the reconstruction of the query sequence from graph path and the cs tag. This is being implemented in vg. The behavior matches that of GAM's alignment description.

That sounds interesting, I'll look into this.

maickrau commented 3 years ago

Starting with 7f1ed72 the cigar string uses =/X by default. The option "--cigar-match-mismatch" uses M instead and merges runs of matches/mismatches to one M