populationgenomics / talos

Rare Disease variant reanalysis tool

MIT License

6 stars 4 forks source link

Implement checkpoint resumption #362

Closed MattWellie closed 8 months ago

MattWellie commented 8 months ago

Fixes

Some AIP runs are getting stuck at the moment, spending hours and then failing due to timeouts. This failure mode might be caused by general business of Hail/GCP, and doesn't appear to be due to bad data (though these runtimes are unprecedented).

Proposed Changes

Implements a checkpoint resumption process if the target checkpoint already exists
Should be followed by a pipeline job - delete checkpoints if this stage succeeds

Considerations

this has caused some issues in production, with runs resuming from bad data. I'll need to keep this in mind, but so far runs have not failed for data reasons, just for scheduling reasons