sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
24 stars 6 forks source link

Add support for multiple files to vcfpartition #212

Closed jeromekelleher closed 2 months ago

jeromekelleher commented 3 months ago

Should be a straightforward update. The only question really is whether the --num-parts option should be interpreted as per-file, or overall. It seems fairly clear that we would want the total number of partitions (that's what dexplode-init does), but we'd need to document that these are not distributed evenly over the files (or implement it so that they are).

Will-Tyler commented 3 months ago

I'm interested in working on this task.

If the CLI interprets --num-parts as the total, the CLI could mirror scan_vcfs's logic to divide the partitions evenly: max(1, target_num_partitions // len(paths)). I am wondering if alternatively the CLI should distribute the partitions proportionally to the size of the VCF files so that larger VCFs are divided into more partitions.

jeromekelleher commented 2 months ago

Hi @Will-Tyler, thanks for the offer of contributing! :wave:

I think we should implement the simple solution of splitting evenly per input VCF in the first instance because otherwise we have to stat all the inputs, which would lead to significant latency if we have hundreds of files. We could implement a more sophisticated strategy later, but I think just following what scan_vcfs is doing for now would be a good start.

Just FYI, I'm on leave for a weeks or so might be a bit slow to respond.