payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Graceful recovery from PBS error #43

Open aidanheerdegen opened 9 years ago

aidanheerdegen commented 9 years ago

Would be nice to specify some PBS error codes from which we would gracefully recover and resubmit the jobs.

An example would be

Exit Status: -14 (MotherSuperior->SisterMoms communication failure)

There is another one where PBS just times out, but I can't find an example of that.

Others I see which I guess would need intervention:

Exit Status: 271 (Linux Signal 15) # walltime exceeded

aidanheerdegen commented 4 years ago

Learning from the approach in #241 there might be a nice general way to do this.

Define a new config.yaml option: error

error:
   stop:
           - error string indicating run cannot continue
   resubmit:
           - error string
           - another error string
           - a third error string

This allows for defining multiple resubmit error strings to search for in self.stderr_fname. If any of those are found then sweep and resubmit the run. Could also have options like stop, which, if found, halt the run regardless of the presence of resubmit strings.

Any other options required?

aidanheerdegen commented 4 months ago

This seems to be a more prevalent problem affecting lower core count jobs too

https://forum.access-hive.org.au/t/automatic-resubmission-in-payu-access-esm1-5/2092