statgen / bamUtil

http://genome.sph.umich.edu/wiki/BamUtil
89 stars 30 forks source link

CIGAR does not evaluate to the same length as SEQ #17

Open h-2 opened 8 years ago

h-2 commented 8 years ago

This warning is produced by validate even if the SEQ is set to *. The SAM spec explicitly states that it is ok for the SEQ to be * and that the lengths don't have to match:

If not a ‘*’, the length of the sequence must equal the sum of lengths of M/I/S/=/X operations in CIGAR

It even states that SEQ should be * for secondary alignments:

SEQ and QUAL of secondary alignments should be set to ‘*’ to reduce the file size.

Or am I misunderstanding something?

Thanks!

pjvandehaar commented 8 years ago

Thanks for pointing this out and finding the relevant parts of the documentation. I believe that you are correct. bamUtil won't warn if the CIGAR is *, but doesn't check SEQ's contents.

That validation is done in libStatGen, and I'm afraid that some programs using that library might make assumptions without knowing about *. I'll test things and report back.

mktrost commented 8 years ago

I had not considered secondary reads so when I put in the validation, I didn't see a situation where the cigar would have value, but there would be no sequence. You are correct, that validation does not seem to be valid, so I will try to fix that validation to only validate if SEQ is not '*'. I should be able to do this sometime this week. I'll let you know when complete.

mktrost commented 8 years ago

I updated libStatGen so that the cigar/seq length validation should only occur if the seq is not '*'. Please use the most recent libStatGen and let me know if this fixes the problem or if you continue to have issues.