Open benclifford opened 2 years ago
I've recreated this in my dev environment by replacing the submit
call for my relevant local provider (kube.py) with return None
, which emulates the slurm failure for the purposes of this buggy exception report - point 1 in the issue.
For point 2, I have discussed internally with @sirosen about logging parsl (and more) error messages to the endpoint logs.
The 2nd part of this might have been fixed by @rjmello 7b221920a9367b8629be4a454fb2ec81f2af2932
Describe the bug This is based on a report in the #help slack channel
When the slurm provider fails to scale out, the code that is supposed to report that to the user fails in potentially several ways:
1) This seems to be static type error in the exception handling code for scale_out failing, when constructing a more specific exception - interchange indeed has no
config
.parsl.providers.slurm
but the endpoint admin was unable to find the relevant log message - maybe it should appear around the same place as the above report? The relevant parsl log line is:To Reproduce Get endpoint to try to scale out with a broken provider/provider configuration
Expected behavior The errors coming from parsl.providers should lead the user towards fixing the problem (in the example user's case, a quota exhaustion reported by
sbatch
) rather than being hiddenEnvironment slurm other component versions unknown