pezmaster31 / bamtools

C++ API & command-line toolkit for working with BAM data
MIT License
417 stars 153 forks source link

read with single base and quality = 9 #100

Open grenaud opened 10 years ago

grenaud commented 10 years ago

I have a bam record with a single base "A" and a quality score of 9 which is a "*" in the ASCII table (33+9). However, I would expect to see a 42 but xxd -b tells me I have: 00010000 11111111 however, I would expect this: 00010000 00001001

Can this be fixed ?

pezmaster31 commented 10 years ago

Don't forget that BAM is not "simply@ binary, but compressed as well. If you're just using cmd line tools, run it through (samtools/tabix) 'bgzip -dc' and pipe into xxd.

grenaud commented 10 years ago

Hi Derek ! Thanks for your reply. Yes the message above is decommpressed and piped into xxd.

pezmaster31 commented 10 years ago

Ok, just wanted to check the easy stuff first. :)

Looked a bit into this, and it's basically due to the fact that '*' is a special char in SAM, indicating 'no data' for that field.

So for compatibility's sake: when BT is writing BAM records and sees there's a single quality char that's an asterisk, it sets the qualities string to all 1's (BAM way of indicating no data).

I guess I can ignore the SAM-compatible behavior, and write the actual value, if the bases field is length 1 .

grenaud commented 10 years ago

That would be great ! I understand when we read SAM, it is impossible to tell whether the quality field is empty but this is via C++ code and we can disentangle both cases. I noticed when another program started complaining that my other program was producing empty fields.

pezmaster31 commented 10 years ago

Sorry for the inconvenience. Guess the idea of a 1-base alignment never occurred to me when I wrote that little workaround. :)

grenaud commented 10 years ago

yeah we deal with ancient DNA and we can end up with really tiny sequences.