samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Implementation of a SAMRecord comparator that matches samtools' queryname sort order. #1600

Open tfenne opened 2 years ago

tfenne commented 2 years ago

@nh13, @lbergelson I put this together mostly for fun - it's a comparator that will sort read names the same way samtools does. I'm not really sure what the best way to integrate this is in HTSJDK though. Thoughts?

codecov-commenter commented 2 years ago

Codecov Report

Merging #1600 (2260167) into master (a38c78d) will increase coverage by 0.014%. The diff coverage is 94.118%.

@@               Coverage Diff               @@
##              master     #1600       +/-   ##
===============================================
+ Coverage     69.838%   69.852%   +0.014%     
- Complexity      9653      9664       +11     
===============================================
  Files            703       704        +1     
  Lines          37647     37681       +34     
  Branches        6114      6121        +7     
===============================================
+ Hits           26292     26321       +29     
- Misses          8904      8906        +2     
- Partials        2451      2454        +3     
Impacted Files Coverage Δ
.../samtools/SAMRecordQueryNameNaturalComparator.java 94.118% <94.118%> (ø)
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 80.952% <0.000%> (-9.524%) :arrow_down:
src/main/java/htsjdk/io/AsyncWriterPool.java 72.222% <0.000%> (-1.389%) :arrow_down:
nh13 commented 2 years ago

Always best to be I second place: https://github.com/samtools/htsjdk/pull/1601

nh13 commented 2 years ago

Do you need to fall back on the first and second of pair flags as samtools does? https://github.com/samtools/samtools/blob/bdc5bb81c11f2ab25ea97351213cb87b33857c4d/bam_sort.c#L1798

tfenne commented 2 years ago

@nh13 I guess it depends on whether you want to match samtools 100% or implement the "natural" sort order which doesn't prescribe how ties are broken. I don't think I have a strong opinion so long as e.g. first of pair sorts before second of pair, and secondary/supplementary records sort after primary

tfenne commented 2 years ago

I think the ideal way to make this work, but which would be a PITA to retrofit would be something akin to fgbio's SamOrder, which defines more than the SO tag, and has capabilities for setting SO, SS etc. and for detecting the ordering based on more than the SO tag. But we had the benefit of writing that many years later, and after the sub-sort stuff landed in the spec.

lbergelson commented 2 years ago

Ah, the eternal problem https://github.com/samtools/htsjdk/issues/766. I was assuming that awkwardly working around this was the impetus for the last pr?

I think we have a few options none of which are trivial.
I like what you describe. The subsort field was invented to fix this problem so it makes sense that we would use it. Do you know if samtools is outputting the appropriate SS field in new versions? We never started doing so, but maybe they're more responsible... That would solve the problem going forward, and we could allow users to pick which variant order they want to use when writing new files. It wouldn't solve the problem of reading older sam files.

An alternative would be to try to make some sort of meta queryname sorter which tries to figure out the subsort when it's reading a file. That shouldn't be hard I would think, just try both sort orders and if one passes it's valid. I can't remember if there's any issue with stateful comparators but I think it wouldn't be a problem to have one that figures out which it is and then remembers it.

We should write the SS field in either case...

tfenne commented 2 years ago

@lbergelson I don't think samtools writes the SS field either for queryname sorting. Though we might be able to get @jmarshall or @nh13 to help with that.

I think the easiest first step would be to have HTSJDK accept either as being queryname sorted - HTSJDK already has that SAMSortOrderChecker which is stateful. I don't think it would be hard to extend that to allow for either/or checking for queryname sorting. And that way we could at least get to the point where HTSJDK (and picard etc.) would accept either samtools or HTSJDK queryname sorting (or, "natural" and "lexicographic") as valid. It's possible that would just fix ValidateSamFile in Picard, or it might require a little more tweaking.

Then the question becomes ... if it can accept either as valid, is there any reason not to switch to the samtools style as the default output for queryname?

lbergelson commented 2 years ago

The main reason not to switch the default immediately is that a lot of people used mixed pipeline where some of the tools are on older versions. Switching would break them until they update all parts of the pipeline.

Does samtools have trouble reading htsjdk queryname bams? I'm not sure the problem is symmetrical.

tfenne commented 2 years ago

I don't believe there are issues with samtools reading picard/htsjdk queryname sorted files, but it's not something I do regularly. I think samtools tends more towards trusting/assuming the user provided acceptable input and not validating.