tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Match chunks of samples #793

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

We want to be able to run the sample matching phase in parallel for large sets of samples.

A simple way to do this is to run slices of samples, storing the match results, and then add them back in later. Something like

tsinfer match-sample-slice <sample data> [standard options] <start> <end> <output-file>

where start and end are python-style slice coordinates into the samples (individuals)

Output file is a path to a file we write the results in (say) JSON format. Each result is a path, list of (left, right, parent) tuples and set of mutations.

We also want a command to combine these paths back to create the final tree sequence, say

tsinfer combine-sample-slices <sample data> [standard options] slice_file1, ..., slice_file_n

I think this should be deterministic and give the same results as the standard approach.

jeromekelleher commented 1 year ago

It would be nice if tsinfer combine-sample-slices warned if the total number of sample provided is not the same as what's in the original sample data file, but it shouldn't be a hard requirement. We'll want to look at the trees we get for a subset of the samples.

jeromekelleher commented 1 year ago

Example here of abstracting out the matching from final result building.