Closed jeromekelleher closed 5 months ago
I'm interested in working on this task.
If the CLI interprets --num-parts
as the total, the CLI could mirror scan_vcfs
's logic to divide the partitions evenly: max(1, target_num_partitions // len(paths))
. I am wondering if alternatively the CLI should distribute the partitions proportionally to the size of the VCF files so that larger VCFs are divided into more partitions.
Hi @Will-Tyler, thanks for the offer of contributing! :wave:
I think we should implement the simple solution of splitting evenly per input VCF in the first instance because otherwise we have to stat
all the inputs, which would lead to significant latency if we have hundreds of files. We could implement a more sophisticated strategy later, but I think just following what scan_vcfs
is doing for now would be a good start.
Just FYI, I'm on leave for a weeks or so might be a bit slow to respond.
Should be a straightforward update. The only question really is whether the --num-parts option should be interpreted as per-file, or overall. It seems fairly clear that we would want the total number of partitions (that's what dexplode-init does), but we'd need to document that these are not distributed evenly over the files (or implement it so that they are).