Open rabernat opened 3 years ago
It's pretty common for parallel execution engines to retry failed tasks some number of times, e.g., for Cloud Dataflow: "Note: The Dataflow service retries failed tasks up to 4 times in batch mode, and an unlimited number of times in streaming mode. In batch mode, your job will fail; in streaming, it may stall indefinitely."
That said, it might be a bad idea to include this in rechunker. We have such a robust_getitem
function in xarray, which we use when loading remote datasets over a network:
https://github.com/pydata/xarray/blob/4f414f2d5eb2e5a12fb8ae1012c5ac7aa43b6f0b/xarray/backends/common.py#L41
When working with cloud object store data, it's common to get random task failures. These can usually be overcome with retries. Retries can be accomplished with dask by doing
plan.execute(retries=n)
. But what about other executors?Should we incorporate retries into rechunker's plans? If so, what's the best way to do this?
If not, can we work around this at the prefect level and inject retries into our flows?