Open aidanheerdegen opened 9 years ago
Learning from the approach in #241 there might be a nice general way to do this.
Define a new config.yaml
option: error
error:
stop:
- error string indicating run cannot continue
resubmit:
- error string
- another error string
- a third error string
This allows for defining multiple resubmit
error strings to search for in self.stderr_fname
. If any of those are found then sweep
and resubmit
the run. Could also have options like stop
, which, if found, halt the run regardless of the presence of resubmit
strings.
Any other options required?
This seems to be a more prevalent problem affecting lower core count jobs too
https://forum.access-hive.org.au/t/automatic-resubmission-in-payu-access-esm1-5/2092
Would be nice to specify some PBS error codes from which we would gracefully recover and resubmit the jobs.
An example would be
Exit Status: -14 (MotherSuperior->SisterMoms communication failure)
There is another one where PBS just times out, but I can't find an example of that.
Others I see which I guess would need intervention:
Exit Status: 271 (Linux Signal 15) # walltime exceeded