globus / globus-compute

Globus Compute: High Performance Function Serving for Science
https://www.globus.org/compute
Apache License 2.0
144 stars 47 forks source link

parsl provider error messages are lost #679

Open benclifford opened 2 years ago

benclifford commented 2 years ago

Describe the bug This is based on a report in the #help slack channel

When the slurm provider fails to scale out, the code that is supposed to report that to the user fails in potentially several ways:

1) This seems to be static type error in the exception handling code for scale_out failing, when constructing a more specific exception - interchange indeed has no config.

Submission of command to scale_out failed
2022-01-27 14:33:58.605 funcx_endpoint.strategies.simple:43 [ERROR] Caught error in strategize : 'Interchange' object has no attribute 'config'
Traceback (most recent call last):
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/strategies/simple.py", line 41, in strategize
    self._strategize(*args, **kwargs)
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/strategies/simple.py", line 143, in _strategize
    self.interchange.scale_out(excess_blocks)
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/executors/high_throughput/interchange.py", line 1151, in scale_out
    self.config.provider.label,
AttributeError: 'Interchange' object has no attribute 'config'
  1. The parsl layer logs an error to eg parsl.providers.slurm but the endpoint admin was unable to find the relevant log message - maybe it should appear around the same place as the above report? The relevant parsl log line is:
            logger.error("Retcode:%s STDOUT:%s STDERR:%s", retcode, stdout.strip(), stderr.strip())

To Reproduce Get endpoint to try to scale out with a broken provider/provider configuration

Expected behavior The errors coming from parsl.providers should lead the user towards fixing the problem (in the example user's case, a quota exhaustion reported by sbatch) rather than being hidden

Environment slurm other component versions unknown

benclifford commented 2 years ago

I've recreated this in my dev environment by replacing the submit call for my relevant local provider (kube.py) with return None, which emulates the slurm failure for the purposes of this buggy exception report - point 1 in the issue.

For point 2, I have discussed internally with @sirosen about logging parsl (and more) error messages to the endpoint logs.

benclifford commented 5 months ago

The 2nd part of this might have been fixed by @rjmello 7b221920a9367b8629be4a454fb2ec81f2af2932

rjmello commented 5 months ago

The 2nd part of this might have been fixed by @rjmello 7b22192

Correct; I'd expect the logs to show now.