payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 26 forks source link

Add user script to run on failure #241

Closed aidanheerdegen closed 4 years ago

aidanheerdegen commented 4 years ago

Possibly a useful hook to have in any case, but could be used as a temporary fix until https://github.com/payu-org/payu/issues/43 is resolved.

Users can add their own failure user script which can decide to resubmit under certain conditions. This is useful for the ACCESS-OM2-01 model, which is experiencing random frequent segfaults which do not reoccur on resubmission. See https://github.com/COSIMA/access-om2/issues/193.

aekiss commented 4 years ago

Copying from https://arccss.slack.com/archives/C9SEH9PDH/p1585693594004900

If possible it would be good to pass the error code and number of retries to the script, so it would attempt resubmission only for appropriate errors and only a finite number of times...

aidanheerdegen commented 4 years ago

Should be able to do this with environment variables. I'll look into it.

aidanheerdegen commented 4 years ago

The userscripts hook for an error script is now available on gadi under conda/analysis3-20.01.

There is an example MOM6 config in this repo

https://github.com/aidanheerdegen/mom6_error_test

which shows how graceful resubmission after defined errors can be achieved.

In config.yaml define a script to run on error and a command to remove the resubmit counter file when a run is successful:

https://github.com/aidanheerdegen/mom6_error_test/blob/master/config.yaml#L28-L30

Copy the resub.sh script

https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh

Alter the outfile variable to match your model (access-om2.err in your case)

https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L5

Make sure the error messages you want to gracefully recover from and resubmit are set correctly:

https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L5

The second of those ("DT found with multiple inconsistent definitions") is just for testing purposes and should be deleted/replaced.

Set the maximum number of resubmissions

https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L7

aekiss commented 4 years ago

Awesome, thanks @aidanheerdegen, this looks super handy.

Just wondering how that interacts with payu run -n?

e.g. if I do payu run -n 10 and the 3rd run dies, will it do something equivalent to payu sweep; payu run -n 8 to give me 10 runs in total?

I'm probably wrong, but I can't see anywhere in the payu code that decrements PAYU_N_RUNS to account for successful runs, so it looks like this is more like payu sweep; payu run -n 10, giving 12 runs overall...? https://github.com/aidanheerdegen/mom6_error_test/blob/master/resub.sh#L47-L48

aidanheerdegen commented 4 years ago

So it resubmits payu with the current number of runs left (PAYU_N_RUNS)

${PAYU_PATH}/payu run -n ${PAYU_N_RUNS} >> ${logfile}

so it should work seamlessly with payu -n nruns

aidanheerdegen commented 4 years ago

payu does decrement the run counter and repopulate PAYU_N_RUNS, just not directly. IIRC.

aekiss commented 4 years ago

excellent, that's perfect :-)

aidanheerdegen commented 4 years ago

So here it decrements it:

https://github.com/payu-org/payu/blob/11f27d24019332f8317cf62e49825a3955ebb4ae/payu/experiment.py#L694

and here is the resubmit which uses the new value

https://github.com/payu-org/payu/blob/11f27d24019332f8317cf62e49825a3955ebb4ae/payu/experiment.py#L886

aekiss commented 4 years ago

aha! thanks, I didn't look very carefully

aidanheerdegen commented 4 years ago

Anyway, this is tricky to test, so happy for you to take it, try and let me know if you're having any issues. Also happy to hear if it works.

All resubmit info should collect in resubmit.log. So you can check if it has been triggered.

aidanheerdegen commented 4 years ago

Turns out it wasn't straightforward to use environment variables to save the resubmit state, so that is done in a file resubmit.count.

marshallward commented 4 years ago

This looks like a very nice feature; we were just discussing the need for something like this. Looking forward to it!

aidanheerdegen commented 4 years ago

Look forward no longer. It is there ... but hand-rolled. Now that I've made that script and thought about it, actually a general approach wouldn't be too difficult. I'll head over to #43 and rough out some ideas. Would be great to have more input. hint hint

aekiss commented 4 years ago

I've added this to the ak-dev branch for all IAF configs

aekiss commented 4 years ago

@aidanheerdegen It seems the ${error} variable expansion in this line loses the enclosing quotes https://github.com/aidanheerdegen/mom6_error_test/blob/c4ee9706d50f057c5af785b6395ba4b54fe87728/resub.sh#L19 so all but the first word gets treated as a file argument by grep, giving an output like this

grep: fault:: No such file or directory
grep: address: No such file or directory
grep: not: No such file or directory
grep: mapped: No such file or directory
grep: to: No such file or directory
grep: object,: No such file or directory
aekiss commented 4 years ago

"${error}" seems to work as a grep argument (line 19), so long as the trailing comma is removed from line 12 https://github.com/aidanheerdegen/mom6_error_test/blob/c4ee9706d50f057c5af785b6395ba4b54fe87728/resub.sh#L12

aidanheerdegen commented 4 years ago

Thanks for pointing that out @aekiss. I have fixed in my repo also.

As an aside, I'm not sure the resubmit stuff is necessary at 1 deg as it doesn't seem to run into the same problems.

aekiss commented 4 years ago

yeah but I thought it was harmless and would be there if ever needed...