Closed jeromekelleher closed 6 months ago
On this theme - currently if a partition is missing due to job failure, finialise
will error out, running the failed job and then finalising again errors as finalise
already moved things around.
Should finalise try to check if everything is present first? This means looking at each array in each partition, so O(10000) directory check operations. This is probably simpler than trying to make finalise robust to making multiple passes, but is ultimately less robust I guess.
Checking that each partition is present and that there are no wip
arrays should be the equivalent of O(partitions) ls
operations, which I think is doable?
Ok, let's try that in the first instance
I'll code it up later
Ah - we can just write a "partition-done" file, or rename the partition directory to do this. So yeah, we can definitely do a reasonable job of checking in a reasonable time.
136 added the distributed encode operation, but only with very basic tests to check things work in the nominal case.
Needs some tests to be sure we do the right thing when things go wrong.