Open computron opened 7 years ago
What happens if you set max errors to be a larger number and terminate_on_nonzero to False? I need to know why this happens. Is your max error == 2?
The max_errors should be 5 and there are other jobs that completed after 3 errors. e.g., see:
/projects/ps-matqm/prod_runs/block_2016-10-21-19-00-21-067631/launcher_2016-12-05-17-18-15-093034/launcher_2016-12-15-12-48-26-702546
for an example of a run with the same infrastructure, but completed successfully after 3 errors.
In this case, I think it is stopping at 2 errors because the same error is repeated, and custodian is smart enough to stop trying the same fix again and again 5 times.
I tried looking at the code, but for eddrmm errors, there is no "repeated" check, unlike other errors. In fact, EDDRMM errors always result in a corrective action returned. The vasp.out seems to be untouched, even though the INCAR Algo has changed. If I have to speculate, the second time round, VASP didn't run at all and immediately exited, which result in the
Note - to answer @xhqu1981 's question (which I somehow don't see here):
There is both a std_error.txt and std_error.txt.gz. The former is empty. The latter looks like below:
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmpi.so.1 00002B0D71548A84 Unknown Unknown Unknown
libopen-pal.so.6 00002B0D71D6E45B Unknown Unknown Unknown
libmpi.so.1 00002B0D714CACB1 Unknown Unknown Unknown
libmpi.so.1 00002B0D715C90AE Unknown Unknown Unknown
libmpi.so.1 00002B0D715CF6D2 Unknown Unknown Unknown
libmpi.so.1 00002B0D714DFD6F Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B0D7122E4EA Unknown Unknown Unknown
vasp.std 0000000000416628 Unknown Unknown Unknown
vasp.std 000000000056F71A Unknown Unknown Unknown
vasp.std 000000000057ABC9 Unknown Unknown Unknown
vasp.std 0000000000DD1B80 Unknown Unknown Unknown
vasp.std 0000000000E54EBB Unknown Unknown Unknown
vasp.std 000000000152BC27 Unknown Unknown Unknown
vasp.std 0000000000411FF6 Unknown Unknown Unknown
libc.so.6 000000353061ED5D Unknown Unknown Unknown
vasp.std 0000000000411EE9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4DFAA94 Unknown Unknown Unknown
libopen-pal.so.6 00002B8CE562045B Unknown Unknown Unknown
libmpi.so.1 00002B8CE4D7CCB1 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4E7B0AE Unknown Unknown Unknown
libmpi.so.1 00002B8CE4E816D2 Unknown Unknown Unknown
libmpi.so.1 00002B8CE4D91D6F Unknown Unknown Unknown
libmpi_mpifh.so.2 00002B8CE4AE04EA Unknown Unknown Unknown
vasp.std 0000000000416628 Unknown Unknown Unknown
vasp.std 000000000056F71A Unknown Unknown Unknown
vasp.std 000000000057ABC9 Unknown Unknown Unknown
vasp.std 0000000000DD1B80 Unknown Unknown Unknown
vasp.std 0000000000E54EBB Unknown Unknown Unknown
vasp.std 000000000152BC27 Unknown Unknown Unknown
vasp.std 0000000000411FF6 Unknown Unknown Unknown
libc.so.6 000000353061ED5D Unknown Unknown Unknown
vasp.std 0000000000411EE9 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown
vasp.std 000000000178FBFE Unknown Unknown Unknown
vasp.std 0000000001715FA2 Unknown Unknown Unknown
vasp.std 00000000016C30B3 Unknown Unknown Unknown
vasp.std 00000000016C8D79 Unknown Unknown Unknown
libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown
libmkl_avx.so 00002ACD81829C54 Unknown Unknown Unknown
libmkl_avx.so 00002ACD8184AC6A Unknown Unknown Unknown
libmkl_avx.so 00002ACD818230D4 Unknown Unknown Unknown
Thanks @computron a lot. I was wondering whether it is a similar issue in my test. After reading your reporting carefully again, I noticed that your platform is Linux which is not the OS expected to have that issue. As a result, I withdrew the comment yesterday.
To avoid confusing other people, I am duplicating my comment here, I was asking whether std_err printed a line:
"srun: error: Unable to create job step: Job/step already completing or completed"
It is some evidence for VASP fail to launch.
@computron 's current std_err.txt is empty, I don't think std_err provide any evidence about the status of VASP in this situation. I am sorry this is not a helpful clue.
System
Summary
Error message
The stack trace I get back is:
You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.
Files
The run is located in : /projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609
Suggested solution (if known)