schism-dev / schism

Semi-implicit Cross-scale Hydroscience Integrated System Model (SCHISM)
http://ccrm.vims.edu/schismweb/
Apache License 2.0
78 stars 84 forks source link

Exit code is zero even when simulator fails #125

Open AugustoPeres opened 4 months ago

AugustoPeres commented 4 months ago

Hi there,

I have recently started using the schism simulator and noticed that the exit code is zero even when the simulator fails:

root@a28729b6320d:/Test_Convergence_Grid1# mpirun --np 3 /schism/build/bin/pschism  
 Must have at least 1 cmd argument: # of scribes to run, or -v for version.
 Must have at least 1 cmd argument: # of scribes to run, or -v for version.
 Must have at least 1 cmd argument: # of scribes to run, or -v for version.
root@a28729b6320d:/Test_Convergence_Grid1# echo $?
0

Is there anyway we can have the exit code reflect the fact that the simulation failed?

pmav99 commented 4 months ago

yeah, we've also had problems with this. As a workaround, we are parsing the stdout and stderr output and have some heuristics that determine if there was an error after all.

For the record, handling this can be even more complicated because the error codes depend on the MPI implementation, too. For instance, on some tests we did with an older schism version (5.9):

openmpi + mpirun -n 8 schism -> error code 0
mpich + mpirun -n 8 schism -> error code 0 or 9 - about 50-50 between them

Now this might be an issue with openmpi/mpich but it could also be an issue of the way schism's MPI code has been implemented. Haven't really looked deeper into it.

AugustoPeres commented 4 months ago

@pmav99, thank you very much for your reply.

We will take a look at how to parse the stdout and stderr to detect failed simulations. Could you share a little bit more on the heuristics that you are using to catch failed simulations?

However, it you be great if this was working out-of-the-box :)

josephzhang8 commented 4 months ago

The error says you need to specify # of scribe processes; see online manual.