sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
27 stars 7 forks source link

Integrity tests needed for dencode #138

Closed jeromekelleher closed 6 months ago

jeromekelleher commented 6 months ago

136 added the distributed encode operation, but only with very basic tests to check things work in the nominal case.

Needs some tests to be sure we do the right thing when things go wrong.

benjeffery commented 6 months ago

On this theme - currently if a partition is missing due to job failure, finialise will error out, running the failed job and then finalising again errors as finalise already moved things around.

jeromekelleher commented 6 months ago

Should finalise try to check if everything is present first? This means looking at each array in each partition, so O(10000) directory check operations. This is probably simpler than trying to make finalise robust to making multiple passes, but is ultimately less robust I guess.

benjeffery commented 6 months ago

Checking that each partition is present and that there are no wip arrays should be the equivalent of O(partitions) ls operations, which I think is doable?

jeromekelleher commented 6 months ago

Ok, let's try that in the first instance

jeromekelleher commented 6 months ago

I'll code it up later

jeromekelleher commented 6 months ago

Ah - we can just write a "partition-done" file, or rename the partition directory to do this. So yeah, we can definitely do a reasonable job of checking in a reasonable time.