We're hitting some limitations of tsinfer SampleMatcher infrastructure with handling concurrency and limiting memory usage, and the simplest way to deal with this is to switch to using the lower-level CPython interfaces directly.
We create the data model using the _tsinfer.TreeSequenceBuilder. This is a singleton, shared by all the Matcher instances.
We can handle parallelism using concurrent.futures ThreadPoolExecutor, which will make it much simpler to keep track of what's going on
We can avoid conversion to tsinfer's SampleData format. Just passing the sample haplotype to find_path should be sufficient (but there may be gotchas in how the data model is set up)
It should be much easier to maintain compatibility with tsinfer, as we're using low-level components that change very rarely.
Rerunning matches with adaptive precision should be much easier - we won't have to batch things up like before, but can treat each sample independently.
We're hitting some limitations of tsinfer SampleMatcher infrastructure with handling concurrency and limiting memory usage, and the simplest way to deal with this is to switch to using the lower-level CPython interfaces directly.
We create the data model using the _tsinfer.TreeSequenceBuilder. This is a singleton, shared by all the Matcher instances.
We need to do some faffing with the data model get the actual state into this: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/tsinfer/inference.py#L1957
We then create a Matcher instance using this, and the basic parameters: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/tsinfer/inference.py#L1917
Note that in tsinfer the idea is that we reuse a Matcher instance to avoid mallocing memory all the time, but this may be a premature optimisation. In practise, it tends to make us use more memory I think, as each instance hits a high-water RAM usage. (It's basically just allocating the memory for a tree, so reuse really is a bit pointless: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/lib/ancestor_matcher.c#L104C1-L104C23)
The core then is that we call
find_path
then on this instance: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/tsinfer/inference.py#L1851Doing this should have some significant benefits: