samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Make AbstractVCFCodec thread-safe, allowing for multi-threaded VariantContext genotype decoding #1636

Open rcoleb opened 1 year ago

rcoleb commented 1 year ago

Description

Problem: Processing multiple VariantContext objects concurrently results in shared data due to the stateful nature of AbstractVCFCodec. Internal data structures (parts, genotypeParts, alleleMap, lineNo) are re-used to parse the LazyGenotypesContext contained within VariantContext objects; this re-use makes AbstractVCFCodec not thread-safe - calling LazyGenotypesContext.decode() in concurrent threads results in the genotypes of one VariantContext leaking over to other VariantContext objects and overwriting the genotypes there.

Solution: I have wrapped the internal state objects of AbstractVCFCodec in ThreadLocal objects. This will give each parsing thread its own copies of the internal state, removing cross-thread information leakage. This solution should not appreciably affect existing workflows, as single threaded applications will still maintain the resource re-use optimization. This solution also removes the need for expensive and rate-limiting synchronization.

I believe this fixes #1026, as well as the issue reported here.

This also resolves the // todo: make this thread safe? comment on line 69 of AbstractVCFCodec.

Things to think about before submitting:

rcoleb commented 1 year ago

I have pushed a commit that fixes the issue that caused the automatic tests to fail. No error was shown in the IDE due to String.format accepting Object arguments rather than more specifically-typed values. I have checked for other usages of the variable in question and this appears to be the only remaining usage.

lbergelson commented 1 year ago

@rcoleb Thank you for doing this. This seems like a good idea, it's very natural to want to decode these things in parallel and confusing to users that it doesn't work right.

I'll have to take a close look to make sure there aren't any wierd knock on effects, it's vaguely possible there are wierd expectations about String identity equality or something along those lines that might be disrupted by using multiple caches.

Do you know if there is any noticeable performance impact of using threadlocal access vs direct access? I assume it's non 0 but likely minimal compared to the extremely costly and wasteful vcf parsing.

We have a two very large outstanding pr's (#1581 and #1596) which are likely to conflict with this. I will probably hold off on trying to merge this until we get in #1581 (and possibly 1596) in order to reduce any rebase pain. I'm currently in the process of reviewing the first one so that should hopefully move along soon.

rcoleb commented 1 year ago

There will certainly be some performance cost, but it is likely minimal by my estimation: usage of ThreadLocals will inject a couple of additional method calls into the variable usage but because internally they're implemented as a map, not much more than that. And if the genotypes are even being decoded, it seems likely that the addition of method calls is minimal compared to whatever operations will be performed on the genotypes themselves.

As an aside, there is a possible solution for adding a "threadsafe" flag to VCFFileReader constructors, which could direct the AbstractVCFCodec to use thread-safe variables if requested, and direct-access otherwise, but that might be more pollution of the constructor that you'd prefer. Overall, I believe (and our internal testing supports) the performance impacts from ThreadLocal usage are much less than the performance gains from concurrent processing.

Also worth noting, I also don't know how HTSJDK performs with files that don't use AbstractVCFCodec (e.g. BCF files) and whether those would also benefit from thread-safe operations. We don't use those much, so I don't have the internal pressure to optimize a workflow around them.

Sounds good! Please of course take this in when it is appropriate (if at all) - I am just happy for the potential opportunity to contribute to a library that we use heavily in our lab. In the meantime we are synchronizing the genotype decoding, as even the performance impact of synchronization doesn't compare to the gains we are getting from parallel processing.