mtzgroup / chemcloud-client

Python client for TeraChem Cloud
MIT License
11 stars 3 forks source link

Need more helpful error messages from API #28

Closed erbas closed 3 years ago

erbas commented 3 years ago

This is the 500 I mentioned in the previous but report. User really needs more information.

  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 446, in compute_gradient
    results = self.compute_blocking(geom, ('energy', 'gradient'), job_type=JobType.gradient, *args, **kwargs)
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 408, in compute_blocking
    results = [a for a in self.compute(geom, fields, job_type, *args, **kwargs)]
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 408, in <listcomp>
    results = [a for a in self.compute(geom, fields, job_type, *args, **kwargs)]
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 375, in compute
    raise result
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 260, in _process
    result = self.compute__(geom, job_type, workerid, *args, **kwargs)
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/tcc_new_engine.py", line 62, in compute__
    result = future_result.get()
  File "/Users/keiran/.local/lib/python3.8/site-packages/tccloud/models.py", line 119, in get
    while self.status not in _READY_STATES:
  File "/Users/keiran/.local/lib/python3.8/site-packages/tccloud/models.py", line 150, in status
    self.compute_status, self.result = self.client.result(self.to_task())
  File "/Users/keiran/.local/lib/python3.8/site-packages/tccloud/http_client.py", line 315, in result
    task_result = self._authenticated_request(
  File "/Users/keiran/.local/lib/python3.8/site-packages/tccloud/http_client.py", line 196, in _authenticated_request
    return self._request(
  File "/Users/keiran/.local/lib/python3.8/site-packages/tccloud/http_client.py", line 188, in _request
    response.raise_for_status()
  File "/Users/keiran/.local/lib/python3.8/site-packages/httpx/_models.py", line 1426, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: 500 Server Error: Internal Server Error for url: https://tccloud.mtzlab.com/api/v1/compute/result
For more information check: https://httpstatuses.com/500
coltonbh commented 3 years ago

Can you please provide specific inputs to reproduce error? 500 errors are generally not possible to generically provide better information on--they specifically mean "something completely unexpected went wrong on the server, so we have nothing more to say." Errors capable of better reporting are <500's (bad input, unauthorized, all 400's etc...).

Can you please file specifics here for a repo so I can look into it? Thanks!

erbas commented 3 years ago

Yeah hard to provide input as it's arising during a an optimization run building interpolated PES. I think my point would be the user should never see a 500.

Anyway, here's another failure case I see quite a lot of...

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/qcengine/util.py", line 114, in compute_wrapper
    yield metadata
  File "/opt/conda/lib/python3.7/site-packages/qcengine/compute.py", line 91, in compute
    output_data = executor.compute(input_data, config)
  File "/opt/conda/lib/python3.7/site-packages/qcengine/programs/terachem_pbs.py", line 103, in compute
    return client.compute(input_model)
  File "/opt/conda/lib/python3.7/site-packages/tcpb/tcpb.py", line 162, in compute
    while not self.check_job_complete():
  File "/opt/conda/lib/python3.7/site-packages/tcpb/tcpb.py", line 255, in check_job_complete
    status = self._recv_msg(pb.STATUS)
  File "/opt/conda/lib/python3.7/site-packages/tcpb/tcpb.py", line 792, in _recv_msg
    raise ServerError("Could not recv header: {}".format(msg), self)
tcpb.exceptions.ServerError: Could not recv header: [Errno 104] Connection reset by peer

Server Address: ('xs7-0004-terachem-1', 11111)
Could not open logfile
coltonbh commented 3 years ago

lolz. Yeah 500's are never intended. This error usually means the TeraChem PBS server is not available--a worker looked for it, and found nothing listening on the host/port it expected. This could be a result of sending over a bunch of computations that crash the terachem servers, then new jobs are picked up and attempted before the terachem servers can be restarted. There was pretty much no error handling built into the TCPB interface, so you're seeing the same error handling that has been there since the beginning--which is not much... Over time I'm slowly capturing error cases and working on the system to handle errors gracefully; so we live with those errors for now. The error above would be most appropriately addressed in the TCPB package--the error you're seeing above emanates from it. Error handling there could retry connections a few times, etc... Perhaps add it to that package? TCC transparently passes errors raised by lower-level packages.

coltonbh commented 3 years ago

If you would like to address this issue, check out the code here: https://github.com/mtzgroup/tcpb-client/blob/162cecde1652a267ec1e956f16a34dccd539f436/tcpb/tcpb.py#L102-L113. This is the connection code that raises the error. Happy to review a PR you can submit there. Closing since this issue is related to TCPB. Feel free to log issue there and create a PR!