Properly convert missing qualities in BAM

rhpvorderman commented 1 month ago

Sometimes a quality sequence is omitted and it is filled with 0xff in the BAM record. Currently this is not properly handled by dnaio. This PR checks if a quality sequence is omitted and proceeds to fill up the qualities with ! which is the phred corresponding to 0.

rhpvorderman commented 1 month ago

This is a pretty nasty edge case. I am not quite sure about the correct behaviour here. Ideally qualities should be set to None. But that will cause problems with downstream applications that assume qualities are available. Such as cutadapt. Especially given that the rest of the BAM file may have qualities present. So I set quality to 0 now for all bases.

Unfortunately I do have a uBAM file with one (just one!) such a record that I want to process with cutadapt. I am not sure how to proceed here. Any opinions @marcelm ?

rhpvorderman commented 1 month ago

I root-caused the issue. How do you write a quality string of length 1 in SAM format?

H

How do you write a missing quality string for a sequence of length 1 in SAM format?

*

So how do you write a quality string for sequence of length 1 where the one base has exactly a Phred score of 9?

You should not. As per the spec. But if you did not know this it would be *. But this is illegal for sequences that have a known quality value and a length of 1.

BAM does not have this ambiguity. I assume my record was written in SAM initially and then converted to bam. Resulting in an error in sequali. Since dnaio uses the same sort of parsing I also want to fix the bug here.

marcelm commented 1 month ago

Interesting you found out what went wrong! Maybe this edge case even deserves a clarification in the SAM spec?

This is a bit tricky because setting the qualities to ! doesn’t feel quite right. Especially in the not unthinkable case where qualities are missing for all records (maybe someone converted a FASTA file to BAM). Also, maybe one may want to argue that it makes sense to instead pick the highest possible quality value (~).

I think Cutadapt would actually be able to cope with the qualities being None, except at the very end when it tries to write out the record to a FASTQ file (using dnaio of course). It may be cleaner to set the qualities to None during BAM parsing and postpone the decision of what to do with None values to the point when a FASTQ is actually written.

rhpvorderman commented 1 month ago

Fully agree on all of your points.

I will adapt the PR to change the qualities to None.

Also, maybe one may want to argue that it makes sense to instead pick the highest possible quality value (~).

In this particular scenario, a bug caused by a single-nucleotide read, it does not matter either way. A single-nucleotide read is useless regardless of the quality. I don't have to sequence a genome to be 100% sure there is an adenine base in it somewhere.

The other scenario is indeed from FASTA. Then it comes down to whether the sequences are trusted by the researcher or not. I tend to think that a researcher would not use the FASTA if sequences were not trusted. The sane thing would then indeed be to choose phred 93 to signify that these qualities are artificial and stop downstream tools from discarding the reads.

I will postpone any manipulation to FASTQ writer for now, as I will filter out the 1-base read anyway. So it should not be written. Most important is that it can be read and that DNAIO adhers to the BAM spec.

rhpvorderman commented 1 month ago

@marcelm Done!. Also found a possible cause for a bug where the length of the sequence was not checked before deferencing the qualities. In theory this could lead to an error when l_seq == 0. (In practice there should always be a RG:Z: tag so it should not lead to sugmentation faults, but nevertheless it is good to be correct lest there should be some edge cases that I could not think of).

marcelm commented 1 month ago

Thanks, agree also with you regarding that it doesn’t matter which quality in this particular case.

Feel free to merge! (After you’re done force-pushing ;-) )

rhpvorderman commented 1 month ago

Sorry, I noticed some mistakes including a now redundant memset and an incorrect changelog entry. That is now ok. If you don't object, I will make a bugfix release right away.

marcelm commented 1 month ago

Please go ahead with a new release!

rhpvorderman commented 1 month ago

Thank you for your quick reply! Will do it now!

marcelm / dnaio

Properly convert missing qualities in BAM #139