Add `--details` option for showing alignment details

marcelm commented 1 year ago

This adds a command-line option named --details. When provided, strobealign adds a couple of extra SAM tags to each record that contains an aligned read. At the moment, these two tags are added, but this can grow in the future:

na: Number of NAMs
re: Whether the NAMs where found normally (value: 0) or through rescue (value: 1) (Lowercase because tags containing lowercase letters are "reserved for end users" according to the SAM spec.)

I've been wanting to have something like this for a while in order to be able to better see how a certain alignment arose.

This is mainly intended for debugging and would hopefully allow us to more directly see why a read failed to align, why it aligned incorrectly or why anything else unexpected happened. We could also instruct users to re-run strobealign with that option when they report an enexpected alignment, which would maybe allow us to help them or identify the problem without having to re-run the dataset ourselves.

We could also use this (by parsing the tags with an extra script) to get more detailed statistics. For example, we could get the distribution of the number of NAMs or find regions where rescue was necessary particularly often.

Other tags we could add:

Whether a mate was rescued
Whether ungapped (Hamming) or gapped (SSW) alignment was done

For the moment, some function signatures look a bit more complicated because they have gained yet another parameter. However, for many functions, we can switch from passing Statistics and Details to only passing Details (which would need to gain the same attributes as the statistics). Then the functions don't know about general statistics anymore, but only fill in the particular read-specific details. Then after processing the read, the details are added to the statistics. (This is already done in this PR for the no. of rescued reads.)

This PR also contains some refactoring, some of which was necessary to get this to work, but some of which was only indirectly related.

Marking this as draft because I need to measure impact on performance (I expect it to be essentially zero).

ksahlin commented 1 year ago

I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not.

I also had a look at the code changes and looks good. I was a little bit confused whether the is_suppl -> is_primary change is also propagated to the PE reads (I saw only changes in the SE case), but I assume you have checked and also the phiX dataset does not show any changes.

marcelm commented 1 year ago

I forgot to mention: I actually used -d for this option and --details is just an alias. Let me know if you prefer not to use the (valuable) single-character option. BWA-MEM uses -d, where it means "off-diagonal X-dropoff".

I think this PR is excellent, and I agree with what you write about adding whether a mate was rescued and whether it was hamming or not.

I added the mr tag, which indicates that "mate rescue" was done. I've adjusted the terminology also in the logging output to distinguish "NAM rescue" and "mate rescue". I'll look into hamming/SSW as well.

I also had a look at the code changes and looks good. I was a little bit confused whether the is_suppl -> is_primary change is also propagated to the PE reads (I saw only changes in the SE case), but I assume you have checked and also the phiX dataset does not show any changes.

It was indeed confusing: The paired-end code already had the is_primary logic; commit 0755994 just makes it consistent for both the single- and paired-end functions.

ksahlin commented 1 year ago

I forgot to mention: I actually used -d for this option and --details is just an alias. Let me know if you prefer not to use the (valuable) single-character option. BWA-MEM uses -d, where it means "off-diagonal X-dropoff".

Good point, since this will mainly be used for diagnostics I think we can save -d and use only --details.

I added the mr tag, which indicates that "mate rescue" was done. I've adjusted the terminology also in the logging output to distinguish "NAM rescue" and "mate rescue". I'll look into hamming/SSW as well.

Awesome!

marcelm commented 1 year ago

Ok, I removed the -d alias, added a ga tag that shows the no. of gapped alignments, and managed to get rid of the Statistics parameter for some functions (they only get Details now).

The impact on runtime when --details is not used is nonexistent, as expected.

ksahlin commented 1 year ago

okay, great. Happy to merge.

ksahlin / strobealign

Add `--details` option for showing alignment details #318