Closed aidanheerdegen closed 4 years ago
Copying from https://arccss.slack.com/archives/C9SEH9PDH/p1585693594004900
If possible it would be good to pass the error code and number of retries to the script, so it would attempt resubmission only for appropriate errors and only a finite number of times...
Should be able to do this with environment variables. I'll look into it.
The userscripts
hook for an error
script is now available on gadi
under conda/analysis3-20.01
.
There is an example MOM6
config in this repo
https://github.com/aidanheerdegen/mom6_error_test
which shows how graceful resubmission after defined errors can be achieved.
In config.yaml
define a script to run on error and a command to remove the resubmit counter file when a run is successful:
https://github.com/aidanheerdegen/mom6_error_test/blob/master/config.yaml#L28-L30
Copy the resub.sh
script
https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh
Alter the outfile
variable to match your model (access-om2.err
in your case)
https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L5
Make sure the error messages you want to gracefully recover from and resubmit are set correctly:
https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L5
The second of those ("DT found with multiple inconsistent definitions") is just for testing purposes and should be deleted/replaced.
Set the maximum number of resubmissions
https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L7
Awesome, thanks @aidanheerdegen, this looks super handy.
Just wondering how that interacts with payu run -n
?
e.g. if I do payu run -n 10
and the 3rd run dies, will it do something equivalent to payu sweep; payu run -n 8
to give me 10 runs in total?
I'm probably wrong, but I can't see anywhere in the payu code that decrements PAYU_N_RUNS
to account for successful runs, so it looks like this is more like payu sweep; payu run -n 10
, giving 12 runs overall...?
https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L47-L48
So it resubmits payu
with the current number of runs left (PAYU_N_RUNS
)
${PAYU_PATH}/payu run -n ${PAYU_N_RUNS} >> ${logfile}
so it should work seamlessly with payu -n nruns
payu
does decrement the run counter and repopulate PAYU_N_RUNS
, just not directly. IIRC.
excellent, that's perfect :-)
So here it decrements it:
and here is the resubmit which uses the new value
aha! thanks, I didn't look very carefully
Anyway, this is tricky to test, so happy for you to take it, try and let me know if you're having any issues. Also happy to hear if it works.
All resubmit info should collect in resubmit.log
. So you can check if it has been triggered.
Turns out it wasn't straightforward to use environment variables to save the resubmit state, so that is done in a file resubmit.count
.
This looks like a very nice feature; we were just discussing the need for something like this. Looking forward to it!
Look forward no longer. It is there ... but hand-rolled. Now that I've made that script and thought about it, actually a general approach wouldn't be too difficult. I'll head over to #43 and rough out some ideas. Would be great to have more input. hint hint
I've added this to the ak-dev
branch for all IAF configs
@aidanheerdegen It seems the ${error}
variable expansion in this line loses the enclosing quotes
https://github.com/aidanheerdegen/mom6_error_test/blob/c4ee9706d50f057c5af785b6395ba4b54fe87728/resub.sh#L19
so all but the first word gets treated as a file argument by grep
, giving an output like this
grep: fault:: No such file or directory
grep: address: No such file or directory
grep: not: No such file or directory
grep: mapped: No such file or directory
grep: to: No such file or directory
grep: object,: No such file or directory
"${error}"
seems to work as a grep argument (line 19), so long as the trailing comma is removed from line 12
https://github.com/aidanheerdegen/mom6_error_test/blob/c4ee9706d50f057c5af785b6395ba4b54fe87728/resub.sh#L12
Thanks for pointing that out @aekiss. I have fixed in my repo also.
As an aside, I'm not sure the resubmit stuff is necessary at 1 deg as it doesn't seem to run into the same problems.
yeah but I thought it was harmless and would be there if ever needed...
Possibly a useful hook to have in any case, but could be used as a temporary fix until https://github.com/payu-org/payu/issues/43 is resolved.
Users can add their own failure user script which can decide to resubmit under certain conditions. This is useful for the ACCESS-OM2-01 model, which is experiencing random frequent segfaults which do not reoccur on resubmission. See https://github.com/COSIMA/access-om2/issues/193.