mtzgroup / chemcloud-client

Python client for TeraChem Cloud
MIT License
11 stars 3 forks source link

More graceful handling of terachem failures #27

Closed erbas closed 3 years ago

erbas commented 3 years ago

It is quite common for a TeraChem calculation to fail. The TCCloud API needs to provide the user with more helpful information than HTTP 500, so they can do something about modifying their input.

erbas commented 3 years ago

This stack trace might be helpful:

 File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 446, in compute_gradient
    results = self.compute_blocking(geom, ('energy', 'gradient'), job_type=JobType.gradient, *args, **kwargs)
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 408, in compute_blocking
    results = [a for a in self.compute(geom, fields, job_type, *args, **kwargs)]
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 408, in <listcomp>
    results = [a for a in self.compute(geom, fields, job_type, *args, **kwargs)]
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 364, in compute
    raise result
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/base.py", line 260, in _process
    result = self.compute__(geom, job_type, workerid, *args, **kwargs)
  File "/Users/keiran/repos/nanoreactor2/nanoreactor/engine/tcc_new_engine.py", line 66, in compute__
    nanoreactor_output['gradient'] = result.return_result
AttributeError: 'FailedOperation' object has no attribute 'return_result'
coltonbh commented 3 years ago

You are looking for a property on an object that does not exist. This is not a terachem failure. This is user error ;)

Before trying to access a return result, ensure your computation was successful:

if result.success:
    # access your return result, do whatever
else:
    # handle your failed operation.

Note this pattern is covered in the docs here: https://mtzgroup.github.io/tccloud/tutorial/geometry_optimization/

Would be nice to add to the Compute docs section too.

coltonbh commented 3 years ago

You'll note the API returned a object for you to work with, the FailedOperation object (as per the last line of your stack trace). Check docs to see how to understand your failure. These are not 500 errors.

https://mtzgroup.github.io/tccloud/code_reference/FailedOperation/

erbas commented 3 years ago

Yes, I see the logic. This is really a bug report against the TCCNewEngine class. The API needs to return the last n lines of terachem log file so the user has a chance of understanding what caused the failure.

coltonbh commented 3 years ago

Since TeraChem does not return log files via the TCPB interface expect this feature to be some time off, as it would require changes at the TeraChem level, then the TCPB interface level. I agree it would be great to have 👍 If you have experience at this level of TeraChem and can add the ability for the TCPB interface to speak TeraChem log files I'll gladly add it to TCC once complete.

coltonbh commented 3 years ago

A short term hack I use, since TeraChem doesn't speak log files, is to send the same computation to psi4 and get the full log file back for diagnostics. Not ideal, but an option.

erbas commented 3 years ago

The old API and client did this. I think the API server was grabbing the last 100 lines of the log file, using the path information that the TCPB server returned.

psi4 output will be irrelevant for understanding why terachem crashed