sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
26 stars 7 forks source link

New tool: tskit2zarr #232

Open jeromekelleher opened 4 months ago

jeromekelleher commented 4 months ago

It would be useful for simulation and other applications to be able to efficiently convert tskit tree sequences to VCF Zarr. While this can currently be done with tskit vcf and vcf2zarr it's not very efficient, and some information is lost in the VCF conversion.

The CLI would look something like

tskit2zarr convert input.trees output.vcz

we would include the standard options for multiple workers, etc. I don't think there's any need for distributed commands (but they could be added later, I guess).

Operationally, we could depend on tszip using the load function added in 0.2.3. Since tszip is basically tskit + zarr, there's no harm in adding support.

We should probably make this an optional dependency, though, since it is a relatively niche use-case?

jeromekelleher commented 4 months ago

cc @hyanwong - as just discussed

hyanwong commented 2 months ago

Thanks. In my case this would basically be used as a drop-in replacement for tsinfer.SampleData.from_tree_sequence() in a notebook, so I'm wondering about access via a python API? I'm keen to be able to do this for small instances from the REPL, so I can run from a notebook without dropping into the shell.

jeromekelleher commented 2 months ago

Yep, python API would likely be the primary interface

hyanwong commented 2 months ago

Great, I assume we can steal some of the code from https://github.com/tskit-dev/tsinfer/blob/9d8f9349c953da269d69b1b6a6352f10c7c5abb9/tsinfer/formats.py#L1526, although maybe we would be better starting from scratch. Either way, doing the ts->vcf->zarr gives us a reasonable way to test it, I assume.

jeromekelleher commented 2 months ago

I don't think it would share an awful lot of code to be honest, as the writing back-end is where the complexity is. Having said that, I expect it to be quite easy, the only tricky bit is to abstract some of the code used just for converting VCFs to being a bit more general (i.e., shared across the tskit, plink and VCF tools).

hyanwong commented 1 month ago

Just revisiting this, which will be useful for teaching. I would very much like this to be a pure python conversion (and therefore not to depend on cyvcf2, which is e.g. not present in pyodide installations. I don't think it needs VCF reading functionality to convert a tree sequence, right?

jeromekelleher commented 1 month ago

No, but it would involve us making vcf2zarr an optional feature of bio2zarr (i.e., not a required dependency). I think this would cause problems for lots of users, and is perhaps not worth optimising for the tskit-on-pyodide corner case.

hyanwong commented 1 month ago

Fair enough. If we solve https://github.com/tskit-dev/tsinfer/issues/924 then that gives another potential (and perhaps faster, in-memory) route to turn a tree sequence into something that tsinfer can read.