materialsproject / custodian

A simple, robust and flexible just-in-time job management framework in Python.
MIT License
130 stars 102 forks source link

non-ideal error message for max_errors hit #43

Open computron opened 7 years ago

computron commented 7 years ago

System

Summary

Error message

The stack trace I get back is:

Traceback (most recent call last):\n  File \"/projects/matqm/matmethods_env/codes/fireworks/fireworks/core/rocket.py\", line 224, in run\n    m_action = t.run_task(my_spec)\n  File \"/projects/matqm/matmethods_env/codes/atomate/atomate/vasp/firetasks/run_calc.py\", line 167, in run_task\n    c.run()\n  File \"/projects/matqm/matmethods_env/codes/custodian/custodian/custodian.py\", line 323, in run\n    .format(self.total_errors, ex))\nRuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 1. Terminating...'). Exited...\n

You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.

Files

The run is located in : /projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609

Suggested solution (if known)

shyuep commented 7 years ago

What happens if you set max errors to be a larger number and terminate_on_nonzero to False? I need to know why this happens. Is your max error == 2?

computron commented 7 years ago

The max_errors should be 5 and there are other jobs that completed after 3 errors. e.g., see:

/projects/ps-matqm/prod_runs/block_2016-10-21-19-00-21-067631/launcher_2016-12-05-17-18-15-093034/launcher_2016-12-15-12-48-26-702546

for an example of a run with the same infrastructure, but completed successfully after 3 errors.

In this case, I think it is stopping at 2 errors because the same error is repeated, and custodian is smart enough to stop trying the same fix again and again 5 times.

shyuep commented 7 years ago

I tried looking at the code, but for eddrmm errors, there is no "repeated" check, unlike other errors. In fact, EDDRMM errors always result in a corrective action returned. The vasp.out seems to be untouched, even though the INCAR Algo has changed. If I have to speculate, the second time round, VASP didn't run at all and immediately exited, which result in the

computron commented 7 years ago

Note - to answer @xhqu1981 's question (which I somehow don't see here):

There is both a std_error.txt and std_error.txt.gz. The former is empty. The latter looks like below:

forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown vasp.std 000000000178FBFE Unknown Unknown Unknown vasp.std 0000000001715FA2 Unknown Unknown Unknown vasp.std 00000000016C30B3 Unknown Unknown Unknown vasp.std 00000000016C8D79 Unknown Unknown Unknown libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown libmpi.so.1 00002B0D71548A84 Unknown Unknown Unknown libopen-pal.so.6 00002B0D71D6E45B Unknown Unknown Unknown libmpi.so.1 00002B0D714CACB1 Unknown Unknown Unknown libmpi.so.1 00002B0D715C90AE Unknown Unknown Unknown libmpi.so.1 00002B0D715CF6D2 Unknown Unknown Unknown libmpi.so.1 00002B0D714DFD6F Unknown Unknown Unknown libmpi_mpifh.so.2 00002B0D7122E4EA Unknown Unknown Unknown vasp.std 0000000000416628 Unknown Unknown Unknown vasp.std 000000000056F71A Unknown Unknown Unknown vasp.std 000000000057ABC9 Unknown Unknown Unknown vasp.std 0000000000DD1B80 Unknown Unknown Unknown vasp.std 0000000000E54EBB Unknown Unknown Unknown vasp.std 000000000152BC27 Unknown Unknown Unknown vasp.std 0000000000411FF6 Unknown Unknown Unknown libc.so.6 000000353061ED5D Unknown Unknown Unknown vasp.std 0000000000411EE9 Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown vasp.std 000000000178FBFE Unknown Unknown Unknown vasp.std 0000000001715FA2 Unknown Unknown Unknown vasp.std 00000000016C30B3 Unknown Unknown Unknown vasp.std 00000000016C8D79 Unknown Unknown Unknown libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown libmpi.so.1 00002B8CE4DFAA94 Unknown Unknown Unknown libopen-pal.so.6 00002B8CE562045B Unknown Unknown Unknown libmpi.so.1 00002B8CE4D7CCB1 Unknown Unknown Unknown libmpi.so.1 00002B8CE4E7B0AE Unknown Unknown Unknown libmpi.so.1 00002B8CE4E816D2 Unknown Unknown Unknown libmpi.so.1 00002B8CE4D91D6F Unknown Unknown Unknown libmpi_mpifh.so.2 00002B8CE4AE04EA Unknown Unknown Unknown vasp.std 0000000000416628 Unknown Unknown Unknown vasp.std 000000000056F71A Unknown Unknown Unknown vasp.std 000000000057ABC9 Unknown Unknown Unknown vasp.std 0000000000DD1B80 Unknown Unknown Unknown vasp.std 0000000000E54EBB Unknown Unknown Unknown vasp.std 000000000152BC27 Unknown Unknown Unknown vasp.std 0000000000411FF6 Unknown Unknown Unknown libc.so.6 000000353061ED5D Unknown Unknown Unknown vasp.std 0000000000411EE9 Unknown Unknown Unknown forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source
vasp.std 0000000001791329 Unknown Unknown Unknown vasp.std 000000000178FBFE Unknown Unknown Unknown vasp.std 0000000001715FA2 Unknown Unknown Unknown vasp.std 00000000016C30B3 Unknown Unknown Unknown vasp.std 00000000016C8D79 Unknown Unknown Unknown libpthread.so.0 000000353120F7E0 Unknown Unknown Unknown libmkl_avx.so 00002ACD81829C54 Unknown Unknown Unknown libmkl_avx.so 00002ACD8184AC6A Unknown Unknown Unknown libmkl_avx.so 00002ACD818230D4 Unknown Unknown Unknown

xhqu1981 commented 7 years ago

Thanks @computron a lot. I was wondering whether it is a similar issue in my test. After reading your reporting carefully again, I noticed that your platform is Linux which is not the OS expected to have that issue. As a result, I withdrew the comment yesterday.

xhqu1981 commented 7 years ago

To avoid confusing other people, I am duplicating my comment here, I was asking whether std_err printed a line:

"srun: error: Unable to create job step: Job/step already completing or completed"

It is some evidence for VASP fail to launch.

@computron 's current std_err.txt is empty, I don't think std_err provide any evidence about the status of VASP in this situation. I am sorry this is not a helpful clue.