Open jeromekelleher opened 5 months ago
cc @hyanwong - as just discussed
Thanks. In my case this would basically be used as a drop-in replacement for tsinfer.SampleData.from_tree_sequence()
in a notebook, so I'm wondering about access via a python API? I'm keen to be able to do this for small instances from the REPL, so I can run from a notebook without dropping into the shell.
Yep, python API would likely be the primary interface
Great, I assume we can steal some of the code from https://github.com/tskit-dev/tsinfer/blob/9d8f9349c953da269d69b1b6a6352f10c7c5abb9/tsinfer/formats.py#L1526, although maybe we would be better starting from scratch. Either way, doing the ts->vcf->zarr gives us a reasonable way to test it, I assume.
I don't think it would share an awful lot of code to be honest, as the writing back-end is where the complexity is. Having said that, I expect it to be quite easy, the only tricky bit is to abstract some of the code used just for converting VCFs to being a bit more general (i.e., shared across the tskit, plink and VCF tools).
Just revisiting this, which will be useful for teaching. I would very much like this to be a pure python conversion (and therefore not to depend on cyvcf2, which is e.g. not present in pyodide installations. I don't think it needs VCF reading functionality to convert a tree sequence, right?
No, but it would involve us making vcf2zarr an optional feature of bio2zarr (i.e., not a required dependency). I think this would cause problems for lots of users, and is perhaps not worth optimising for the tskit-on-pyodide corner case.
Fair enough. If we solve https://github.com/tskit-dev/tsinfer/issues/924 then that gives another potential (and perhaps faster, in-memory) route to turn a tree sequence into something that tsinfer can read.
It would be useful for simulation and other applications to be able to efficiently convert tskit tree sequences to VCF Zarr. While this can currently be done with
tskit vcf
andvcf2zarr
it's not very efficient, and some information is lost in the VCF conversion.The CLI would look something like
we would include the standard options for multiple workers, etc. I don't think there's any need for distributed commands (but they could be added later, I guess).
Operationally, we could depend on tszip using the load function added in 0.2.3. Since tszip is basically tskit + zarr, there's no harm in adding support.
We should probably make this an optional dependency, though, since it is a relatively niche use-case?