sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

Two-pass non-Dask VCF conversion #1185

Closed jeromekelleher closed 9 months ago

jeromekelleher commented 9 months ago

Very much WIP - not ready for review!

jeromekelleher commented 9 months ago

I've just added a basic plink conversion approach, which converts the HAPNEST chr21 in about 20 minutes (6 workers, 8 encode threads per worker, max of about 40 gigs of RAM per worker). It's chugging through chr2 in what looks like linearly scaling time, so something in the order of an hour. Watching on linux perf, the vast majority of the time is spent on Blosc encoding and compressing the chunks.

In contrast, using the existing plink_to_zarr function, Dask seems to sit there thinking about the task graph for several minutes before doing anything useful (and emits an opaque and unhelpful warning for users who just want to convert their data). Looking at perf, time seems to be mostly spent doing numpy things, with Blosc encoding coming much further down (although it's using lz4, so not a like-for-like comparison).

I'll update when it finishes to give the overall timing.

jeromekelleher commented 9 months ago

Update - it failed after about an hour with a bunch of completely cryptic messages.

jeromekelleher commented 9 months ago

Closing as development has moved to https://github.com/jeromekelleher/bio2zarr