Closed Icemole closed 3 months ago
For reference, related is the recent change in PR #191 on how to handle TimeoutExpired
.
We use the
gateway
option and there's aTimeoutError
here.
I think you mean TimeoutExpired
?
I already asked in #191 whether catching TimeoutExpired
is really correct, or whether we just should pass on this exception, exactly as you do now in this PR. In #191, there was no answer to that. See https://github.com/rwth-i6/sisyphus/pull/191#discussion_r1642474520 and https://github.com/rwth-i6/sisyphus/pull/191#discussion_r1644426773.
As I wrote before, I wonder, why do we actually do this change? Why not keep the original behavior, i.e. just not handle TimeoutExpired
in system_call
?
Thanks for the comments Albert. I didn't see @michelwi's and your comments. Indeed, this was the source of an issue.
I see that the return value is correctly handled below. I will therefore leave the code as it was, catching the TimeoutExpired
outside system_call
.
I will therefore leave the code as it was, catching the
TimeoutExpired
outsidesystem_call
.
I.e. you update this PR here, to remove the try:
/except TimeoutExpired:
in system_call
?
I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.
I.e. you update this PR here, to remove the try:/except TimeoutExpired: in system_call?
Yes! I did that just now :)
This was the root cause of several jobs with the same features being submitted, even though they had already been submitted:
gateway
option and there's aTimeoutError
here.ssh gw-02 squeue
finishes with error -1. Sisyphus then reaches here safely. It hasn't crashed with aTimeoutError
because we haven't propagated it.TimeoutError
is raised after callingself.system_call()
because it's already been addressed insideself.system_call()
and it hasn't been propagated. Therefore, the process ignores theexcept
block here. Ifretval == -1
, sisyphus doesn't care either!In my view, the problem comes from not addressing errors that could have been obtained from the return values. Therefore, this PR correctly addresses return codes different from zero, and leaves the work of addressing the exceptions of the subprocesses to
self.system_call()
.