Some AIP runs are getting stuck at the moment, spending hours and then failing due to timeouts. This failure mode might be caused by general business of Hail/GCP, and doesn't appear to be due to bad data (though these runtimes are unprecedented).
Proposed Changes
Implements a checkpoint resumption process if the target checkpoint already exists
Should be followed by a pipeline job - delete checkpoints if this stage succeeds
Considerations
this has caused some issues in production, with runs resuming from bad data. I'll need to keep this in mind, but so far runs have not failed for data reasons, just for scheduling reasons
Fixes
Proposed Changes
Considerations