jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Use lower-level tsinfer components #381

Open jeromekelleher opened 4 hours ago

jeromekelleher commented 4 hours ago

We're hitting some limitations of tsinfer SampleMatcher infrastructure with handling concurrency and limiting memory usage, and the simplest way to deal with this is to switch to using the lower-level CPython interfaces directly.

We create the data model using the _tsinfer.TreeSequenceBuilder. This is a singleton, shared by all the Matcher instances.

We need to do some faffing with the data model get the actual state into this: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/tsinfer/inference.py#L1957

We then create a Matcher instance using this, and the basic parameters: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/tsinfer/inference.py#L1917

Note that in tsinfer the idea is that we reuse a Matcher instance to avoid mallocing memory all the time, but this may be a premature optimisation. In practise, it tends to make us use more memory I think, as each instance hits a high-water RAM usage. (It's basically just allocating the memory for a tree, so reuse really is a bit pointless: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/lib/ancestor_matcher.c#L104C1-L104C23)

The core then is that we call find_path then on this instance: https://github.com/tskit-dev/tsinfer/blob/7f23758e21b9c454d8925aa1bb81bd42e5c003c8/tsinfer/inference.py#L1851

Doing this should have some significant benefits:

  1. We can handle parallelism using concurrent.futures ThreadPoolExecutor, which will make it much simpler to keep track of what's going on
  2. We can avoid conversion to tsinfer's SampleData format. Just passing the sample haplotype to find_path should be sufficient (but there may be gotchas in how the data model is set up)
  3. It should be much easier to maintain compatibility with tsinfer, as we're using low-level components that change very rarely.
  4. Rerunning matches with adaptive precision should be much easier - we won't have to batch things up like before, but can treat each sample independently.
jeromekelleher commented 4 hours ago

We should put this infrastructure in a new module, matching.py