rwth-i6 / sisyphus

A Workflow Manager in Python
Mozilla Public License 2.0
45 stars 24 forks source link

Catch `subprocess.TimeoutError` once #196

Closed Icemole closed 3 months ago

Icemole commented 3 months ago

This was the root cause of several jobs with the same features being submitted, even though they had already been submitted:

  1. We use the gateway option and there's a TimeoutError here.
  2. From the point of view of sisyphus, the ssh gw-02 squeue finishes with error -1. Sisyphus then reaches here safely. It hasn't crashed with a TimeoutError because we haven't propagated it.
  3. No TimeoutError is raised after calling self.system_call() because it's already been addressed inside self.system_call() and it hasn't been propagated. Therefore, the process ignores the except block here. If retval == -1, sisyphus doesn't care either!

In my view, the problem comes from not addressing errors that could have been obtained from the return values. Therefore, this PR correctly addresses return codes different from zero, and leaves the work of addressing the exceptions of the subprocesses to self.system_call().

albertz commented 3 months ago

For reference, related is the recent change in PR #191 on how to handle TimeoutExpired.

albertz commented 3 months ago

We use the gateway option and there's a TimeoutError here.

I think you mean TimeoutExpired?

albertz commented 3 months ago

I already asked in #191 whether catching TimeoutExpired is really correct, or whether we just should pass on this exception, exactly as you do now in this PR. In #191, there was no answer to that. See https://github.com/rwth-i6/sisyphus/pull/191#discussion_r1642474520 and https://github.com/rwth-i6/sisyphus/pull/191#discussion_r1644426773.

albertz commented 3 months ago

As I wrote before, I wonder, why do we actually do this change? Why not keep the original behavior, i.e. just not handle TimeoutExpired in system_call?

Icemole commented 3 months ago

Thanks for the comments Albert. I didn't see @michelwi's and your comments. Indeed, this was the source of an issue.

I see that the return value is correctly handled below. I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.

albertz commented 3 months ago

I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.

I.e. you update this PR here, to remove the try:/except TimeoutExpired: in system_call?

Icemole commented 3 months ago

I will therefore leave the code as it was, catching the TimeoutExpired outside system_call.

I.e. you update this PR here, to remove the try:/except TimeoutExpired: in system_call?

Yes! I did that just now :)