pangeo-data / rechunker

Disk-to-disk chunk transformation for chunked arrays.
https://rechunker.readthedocs.io/
MIT License
162 stars 25 forks source link

Retries with prefect (and other executors) #51

Open rabernat opened 3 years ago

rabernat commented 3 years ago

When working with cloud object store data, it's common to get random task failures. These can usually be overcome with retries. Retries can be accomplished with dask by doing plan.execute(retries=n). But what about other executors?

Should we incorporate retries into rechunker's plans? If so, what's the best way to do this?

If not, can we work around this at the prefect level and inject retries into our flows?

shoyer commented 3 years ago

It's pretty common for parallel execution engines to retry failed tasks some number of times, e.g., for Cloud Dataflow: "Note: The Dataflow service retries failed tasks up to 4 times in batch mode, and an unlimited number of times in streaming mode. In batch mode, your job will fail; in streaming, it may stall indefinitely."

That said, it might be a bad idea to include this in rechunker. We have such a robust_getitem function in xarray, which we use when loading remote datasets over a network: https://github.com/pydata/xarray/blob/4f414f2d5eb2e5a12fb8ae1012c5ac7aa43b6f0b/xarray/backends/common.py#L41