error handling - Githubissues

radical-cybertools / ExTASY

MDEnsemble

Other

1 stars 1 forks source link

error handling #224

Closed vivek-bala closed 8 years ago

vivek-bala commented 8 years ago

From issue #223 :

The 1st CU unit.000000/STDERR contains an error: ModuleCmd_Load.c(226):ERROR:105: Unable to locate a modulefile for 'python' and so the CU should be failing, but according the enmd, the pre_loop step was completed successfully. There is some error handling missing here.
The next CUs unit.000001 thru 000008 are full of errors: run.sh: line 22: grompp: command not found run.sh: line 23: mdrun: command not found cat: confout.gro: No such file or directory However, according the enmd, again these CUs are seen as successful - so the error handling here needs to be improved.

vivek-bala commented 8 years ago

In the first CU, the particular module is not found and hence the error is written to STDERR. But execution goes forward with the default python loaded on login (which seems to be sufficient).

In the gromacs CUs, whether the execution is successful or not, the output from gromacs is written to STDERR. So without analysis the STDERR from the client side (some text wrangling required possible to pickup any error code generated or search for "error"), I am not sure if its possible to distinguish between the two. Any ideas ?

vivek-bala commented 8 years ago

Previously, I tried to stop execution when there was some content in STDERR but since gromacs (and possibly other kernels) write the output to STDERR, it wouldn't be correct to stop execution just because of "some" content in STDERR as well.

ibethune commented 8 years ago

OK, I think the most robust way to handle error cases is rather than checking STDERR (either that it is non-empty, or doing application-specific grepping), is to test the return code from the application. If it is non-zero then the CU should fail.

In the cases above, in the preloop CU we are doing the right thing - checking the exit code from spliter.py and exiting with that code.

In the gromacs CUs the radical_pilot_cu_launch_script.sh does the right thing but the run.py wrapper launches grompp, mdrun etc. without checking the error codes, so we should fix that.

vivek-bala commented 8 years ago

I'll be moving away from this wrapper method, so this shouldn't come up once that is done.

vivek-bala commented 8 years ago

Keep the wrapper method as an example. Add the 1sim/1CU method as an example as well. Depending on number of simulations and simulation length, user can choose either method.