Many errors are not obvious, or are even difficult to discover.
on the agent side, .err files may be empty, while ERROR level log messages appear in corresponding .log files.
Errors on the execution side may not produce errors or exceptions in the "client side" script that invoked RP.
Failures in bootstrapping or other details outside of a Task may cause abrupt termination of the session with no indication to the caller of what went wrong. The script simply completes early with very little output.
Some output seems to be completely lost. A malformed slurm batch script for instance, does not seem to result in a non-zero exit code being recorded anywhere, and the error that you would normally see on the command line (invalid resource request, or whatever) doesn't seem to be in any logs on the client or agent side.
I agree. I had a similar issues with the bootstrapper and felt the same frustration. Could you open a ticket in RP about this so that we can start a discussion outside the scope of this PR?
Many errors are not obvious, or are even difficult to discover.
.err
files may be empty, while ERROR level log messages appear in corresponding.log
files._Originally posted by @mturilli in https://github.com/radical-cybertools/radical.pilot/pull/2855#discussion_r1144742284_