Open jeromekelleher opened 1 year ago
It would be nice if tsinfer combine-sample-slices
warned if the total number of sample provided is not the same as what's in the original sample data file, but it shouldn't be a hard requirement. We'll want to look at the trees we get for a subset of the samples.
Example here of abstracting out the matching from final result building.
We want to be able to run the sample matching phase in parallel for large sets of samples.
A simple way to do this is to run slices of samples, storing the match results, and then add them back in later. Something like
where
start
andend
are python-style slice coordinates into the samples (individuals)Output file is a path to a file we write the results in (say) JSON format. Each result is a path, list of
(left, right, parent)
tuples and set of mutations.We also want a command to combine these paths back to create the final tree sequence, say
I think this should be deterministic and give the same results as the standard approach.