Open abarciauskas-bgse opened 9 months ago
Julius Buseke has been running a bunch of the CMIP6 archive through pangeo-forge-recipes (on dataflow). I can ask him if he has found any good ways to re-run failed jobs and keep track of them.
Ha, I had a similar ticket I closed yesterday 😄
I like the Nan
route as a last resort
Later today I plan to crosswalk what Flink/Beam have for checkpointing (which is another way to solve this). But it depends on the runner. Running with LocalDirectBakery
on a decent sized machine still produces network issues for an auth-fronted s3 bucket. Will also compare to public bucket also
I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.
It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.
cc @ranchodeluxe @norlandrhagen @sharkinsspatial (who came up with the name "skipsies"