samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Embed ref2 unsorted #1558

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago
daviesrob commented 1 year ago

This seems OK in the case where you have no references at all, but I managed to break it by allowing it to find a subset of the references. This may be a different, but related, bug.

To reproduce, make a name-sorted file that has no @SQ UR: tags but does have @SQ M5:. Then find the reference for the first read it would try to load, and make an MD5 reference directory with just that one in it. Then try to convert the file to CRAM.

The problem is that getting into multi_seq mode also allocates c->refs_used[], which triggers the MD5 validation code here. The validation code fails when it can't load one of the used references.

It also looks like there are possible race conditions on fd->no_ref as after this PR it can change state in another thread.

jkbonfield commented 1 year ago

I'm trying to repeat this and cannot get it to fail. Do you have test data please?

Edit: well I can with threads, but if I recall you had a case of failure without threading? I'm working on fixing the trivial thread issues, caused by the new ability to change fd->no_ref from within a thread (previously impossible).

Ah sorry, I had a typo in my ref hash! duh. My guessed at fix cures it sure enough. I still need to tweak some threading cases though.

jkbonfield commented 1 year ago

I think I've fixed the new (or quite possibly old) problem you found, as well as fixed the threading issues caused by changing no_ref from within cram threads. That is particularly insideous as there's so many places it's looked at. In the end I took the same approach we did for some other fields, which is migrate them from fd into the container at the time of creation, and make everything underneath that (eg slices, records, aux fields, etc) all relate to the container variable which doesn't change from that point onwards. It may not be necessary everywhere, but it's safe (and failed attempts to fix this showed at least some of it was necessary).

Also fixed the CRAM decoder to be more robust so it doesn't access uninitialised md5sum buffers when we have no reference loaded. Such CRAM files are malformed, and only temporarily created by the earlier commit (now fixed), but improving the robustness of the decoder is good.

daviesrob commented 1 year ago

This looks better, albeit a bit chatty when running multi-threaded and it discovers it needs no-ref mode. I also see that the stray no_ref = c->no_ref; mentioned above is still there.

Could you squash this into its final form please?

jkbonfield commented 1 year ago

Chatty is deliberate, as it has an impact on the CRAM size. If someone accidentally forgets to specify the reference I'd rather they get a few warning messages, possibly more than once if threaded, than a silent degregadation in compression performance. I don't think it's worth putting lots of extra thread locks and a shared variable to track how many times we've reported the warnings, so I think the current warnings represent a good middle ground unless you can think of a simple fix.

I rebased and squashed it.