Dask and sgkit integration (major release)

tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.

GNU General Public License v3.0

56 stars 13 forks source link

Dask and sgkit integration (major release) #373

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

The plan for the next major release of tsinfer is to integrate with two upstream projects to improve tsinfer's scalability. There are two major parts to this:

Replace our custom data file formats with sgkit datasets. This will make the process of importing data into tsinfer far easier and reduce the maintenance burden for us in terms of providing functionality for manipulating input datasets.
Change our parallelism model from many threads on one machine to being fully distributed using Dask.

Both of these are significant changes, but together they will vastly improve the data import experience (which is currently pretty bad), improve scalability, and ultimately simplify the code as we can use upstream tools for a lot of things like managing progress bars, etc.

cc. @hyanwong @awohns @benjeffery

hyanwong commented 3 years ago

We should ask the good people who develop sgkit to see if they can start implementing any additional functionality we might need to make a seamless transition from our SampleData format to native sgkit formats. For instance, how do we associate site times, ancestral states, and missing data with sites in sgkit format? And are there equivalents of the subset() functionality etc. I assume @jeromekelleher has a good handle on if there is anything bespoke that would need to be added into sgkit. I suppose one way is to see if it's possible to keep most of the formats.py test suite working after a move to sgkit.

jeromekelleher commented 3 years ago

I'll be working on this in the new year - I'm one of the sgkit maintainers.

hyanwong commented 3 years ago

Yep, I gathered as much. I guess we don't need an issue to keep track of what's required, then?

jeromekelleher commented 3 years ago

I don't know yet how the whole thing will fit together - I'll open issues with specific things as it becomes clearer when I start working on it. Basically, though, sgkit datasets are flexible in that you can add in extra variables as you see fit. So, we're going to take a much looser approach to what we require on input data, basically just making sure it's an sgkit dataset with the variables (aka columns) we need. We'll probably provide some tools for updating datasets to include the columns we need. The SampleData class will definitely be retired completely, I'm not sure what'll happen to the AncestorData class yet.

hyanwong commented 3 years ago

Just while I think of it, if we are suggesting using tree sequences as a compressed format e.g. for UKB data, then it would be useful to store quality control data for each site in sgkit (I'm sure is in the works), and then pass it through to the final tree sequence, so that we can e.g. output a QC-containing VCF file for a subset of the data. Come to think of it, we might even be able to use the QC scores for inference.