jguhlin / minimap2-rs

Rust bindings to minimap2 library
Other
60 stars 13 forks source link

Very subtle differences from minimap2 #75

Open rob-p opened 3 weeks ago

rob-p commented 3 weeks ago

I'm just trying to track down the very subtle differences in aligning a dataset between the minimap2-rs bindings and what is output by the reference minimap2 implementation.

Mostly everything is the same, but I noticed some small differences (mostly in the number of supplementary alignments so far, so it's unlikely to affect anything important). However, it raises a broader issue about the degree to which reproducibility might be expected.

For example, one thing I noticed is that the current codebase has the ability to map a read against the index but (1) it doesn't take as input (and therefore doesn't consider) the quality values of the read. What would be required to have a map_with_qual function that also takes the quality string? Also I'm not sure it's related to my issue but (2) the map function also ignores the read name. However, the --seed parameter of minimap2 mentions this is used in tie breaking procedures:

--seed INT | Integer seed for randomizing equally best hits. Minimap2 hashes INT and read name when choosing between equally best hits. [11]

is this something that is done at the level of an ffi function currently exposed, and if so, what would be required to have another map_with_name_and_qual function that takes all 3 of these slices as input?

Anyway, these are very minor details, and so far, using minimap2-rs has been a breeze! Check out the dev branch of oarfish if you want to see how we're currently using it; specifically here.

--Rob

jguhlin commented 3 weeks ago

I don't think quality scores are used in minimap2 for mapping.

The mapping functions are here: https://github.com/lh3/minimap2/blob/8170693de39b667d11c8931d343c94a23a7690d2/minimap.h#L353

The second function is mm_map_frag which has qlens argument, but I believe that is query lengths, rather than quality lengths.

Happy to be wrong, but I don't see it.

As for passing in query name, I can fix that. I think I was just trying to minimize string copies everywhere, and conversion to/from CStr.

rob-p commented 3 weeks ago

Hi @jguhlin,

Thanks for the feedback on this. As I said, I’m not certain what the exact causes of the small discrepancies are. Essentially, what I observe is that out of 10M (simulated) long reads, both native minimap2 and minimap2-rs (with compatible settings) return the same number of mapped reads. The difference I observe is that standard minimap2 reports approximately 6 more supplementary alignments than similarly-configured minimap2-rs. This discrepancy is so small (and the assessed results using these mappings for quantification so similar), that I do not have any major concern. Yet, it’s strange to me that there should be any difference once the seed is set to be the same, as I thought the algorithm was deterministic (and, indeed, I observe exactly the same numbers under multiple runs of either standard minimap2 or using the minimap2-rs library - they consistently differ by this tiny amount, but are perfectly replicable over many runs).

I also agree with your design choice around the map function. Perhaps the best thing to do here would be to add map_with_name or some such, which would be a separate function you could call if you wanted to invoke the corresponding function in minimap2. The minimap2 manpage mentions that when certain chains are tied, the seed is hashed with the query name to do tie breaking. It’s a long shot, but perhaps this could explain the small differences we observe? The specific set of reads can be downloaded from this link if you’re curious what they are.

—Rob

rob-p commented 2 weeks ago

Hi @jguhlin,

Ok, so I've figured this out and it is indeed because of cases where there are chain ties and the hashing of the name breaks them in a way that makes certain alignments supplementary and others not. I've been able to achieve exact concordance with command-line minimap2. To do this, I've added a new function map_with_name to minimap2-rs that is the same as map but which also includes the query name. Right now, the relevant code is duplicated, but you may wish to refactor to minimize such duplication. I'll make a PR shortly and mention this issue. Once it's merged, feel free to close this! Also, though the Aligner clone issue still needs some love, I'd actually really appreciate a tagged release soon, as my makeshift solution for that works well right now and I'd like to tag a new release of oarfish with the raw read alignment capability.

Thanks! Rob