Open jeromekelleher opened 3 years ago
We should ask the good people who develop sgkit to see if they can start implementing any additional functionality we might need to make a seamless transition from our SampleData format to native sgkit formats. For instance, how do we associate site times, ancestral states, and missing data with sites in sgkit format? And are there equivalents of the subset()
functionality etc. I assume @jeromekelleher has a good handle on if there is anything bespoke that would need to be added into sgkit. I suppose one way is to see if it's possible to keep most of the formats.py test suite working after a move to sgkit.
I'll be working on this in the new year - I'm one of the sgkit maintainers.
Yep, I gathered as much. I guess we don't need an issue to keep track of what's required, then?
I don't know yet how the whole thing will fit together - I'll open issues with specific things as it becomes clearer when I start working on it. Basically, though, sgkit datasets are flexible in that you can add in extra variables as you see fit. So, we're going to take a much looser approach to what we require on input data, basically just making sure it's an sgkit dataset with the variables (aka columns) we need. We'll probably provide some tools for updating datasets to include the columns we need. The SampleData class will definitely be retired completely, I'm not sure what'll happen to the AncestorData class yet.
Just while I think of it, if we are suggesting using tree sequences as a compressed format e.g. for UKB data, then it would be useful to store quality control data for each site in sgkit (I'm sure is in the works), and then pass it through to the final tree sequence, so that we can e.g. output a QC-containing VCF file for a subset of the data. Come to think of it, we might even be able to use the QC scores for inference.
The plan for the next major release of tsinfer is to integrate with two upstream projects to improve tsinfer's scalability. There are two major parts to this:
Both of these are significant changes, but together they will vastly improve the data import experience (which is currently pretty bad), improve scalability, and ultimately simplify the code as we can use upstream tools for a lot of things like managing progress bars, etc.
cc. @hyanwong @awohns @benjeffery