Closed erbas closed 3 years ago
Can you please provide specific inputs to reproduce error? 500 errors are generally not possible to generically provide better information on--they specifically mean "something completely unexpected went wrong on the server, so we have nothing more to say." Errors capable of better reporting are <500's (bad input, unauthorized, all 400's etc...).
Can you please file specifics here for a repo so I can look into it? Thanks!
Yeah hard to provide input as it's arising during a an optimization run building interpolated PES. I think my point would be the user should never see a 500.
Anyway, here's another failure case I see quite a lot of...
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/qcengine/util.py", line 114, in compute_wrapper
yield metadata
File "/opt/conda/lib/python3.7/site-packages/qcengine/compute.py", line 91, in compute
output_data = executor.compute(input_data, config)
File "/opt/conda/lib/python3.7/site-packages/qcengine/programs/terachem_pbs.py", line 103, in compute
return client.compute(input_model)
File "/opt/conda/lib/python3.7/site-packages/tcpb/tcpb.py", line 162, in compute
while not self.check_job_complete():
File "/opt/conda/lib/python3.7/site-packages/tcpb/tcpb.py", line 255, in check_job_complete
status = self._recv_msg(pb.STATUS)
File "/opt/conda/lib/python3.7/site-packages/tcpb/tcpb.py", line 792, in _recv_msg
raise ServerError("Could not recv header: {}".format(msg), self)
tcpb.exceptions.ServerError: Could not recv header: [Errno 104] Connection reset by peer
Server Address: ('xs7-0004-terachem-1', 11111)
Could not open logfile
lolz. Yeah 500's are never intended. This error usually means the TeraChem PBS server is not available--a worker looked for it, and found nothing listening on the host/port it expected. This could be a result of sending over a bunch of computations that crash the terachem servers, then new jobs are picked up and attempted before the terachem servers can be restarted. There was pretty much no error handling built into the TCPB interface, so you're seeing the same error handling that has been there since the beginning--which is not much... Over time I'm slowly capturing error cases and working on the system to handle errors gracefully; so we live with those errors for now. The error above would be most appropriately addressed in the TCPB package--the error you're seeing above emanates from it. Error handling there could retry connections a few times, etc... Perhaps add it to that package? TCC transparently passes errors raised by lower-level packages.
If you would like to address this issue, check out the code here: https://github.com/mtzgroup/tcpb-client/blob/162cecde1652a267ec1e956f16a34dccd539f436/tcpb/tcpb.py#L102-L113. This is the connection code that raises the error. Happy to review a PR you can submit there. Closing since this issue is related to TCPB. Feel free to log issue there and create a PR!
This is the 500 I mentioned in the previous but report. User really needs more information.