sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
227 stars 32 forks source link

tskit example of sgkit Zarr for intermediate data #955

Open hammer opened 1 year ago

hammer commented 1 year ago

To be assigned to @benjeffery once he's a member of our org!

hammer commented 1 year ago

https://github.com/pystatgen/sgkit/issues/347 may be related

jeromekelleher commented 1 year ago

The point we're illustrating here is the power of open and extensible formats. Previously we had to convert VCFs to our own zarr formats which was time-consuming and tedious. Now we can just add a few extra fields and bits of metadata to the sgkit dataset, allowing the user to do QC directly and avoiding the need for several copies of the data (beyond pulling data out of VCF, but we'll have made the point about columnar binary storage well by this point I'd imagine).