pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
126 stars 54 forks source link

Support skipping + retry recipes for failure recovery (aka "skipsies") #670

Open abarciauskas-bgse opened 9 months ago

abarciauskas-bgse commented 9 months ago

I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.

It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.

cc @ranchodeluxe @norlandrhagen @sharkinsspatial (who came up with the name "skipsies"

norlandrhagen commented 9 months ago

Julius Buseke has been running a bunch of the CMIP6 archive through pangeo-forge-recipes (on dataflow). I can ask him if he has found any good ways to re-run failed jobs and keep track of them.

ranchodeluxe commented 9 months ago

Ha, I had a similar ticket I closed yesterday 😄

I like the Nan route as a last resort

Later today I plan to crosswalk what Flink/Beam have for checkpointing (which is another way to solve this). But it depends on the runner. Running with LocalDirectBakery on a decent sized machine still produces network issues for an auth-fronted s3 bucket. Will also compare to public bucket also