Open pkopta opened 3 years ago
I think this was partially addressed in 63888cbb, but I'm running into some problems with it.
First, if a job fails to start because the description is invalid, then manager.Manager
will move it into the FAILED
state, rather than submitting it to the Executor
. That makes sense, but in this case no NO_JOBS
event is generated even if the invalid job is the last one, which can cause api.manager.Manager.wait4all()
to wait for such a message forever.
Second, there's a race condition in api.manager.Manager.wait4all()
. If the last job was valid and finished, and its status and NO_JOBS messages have been queued, then the AllJobsFinished
check in api.manager.Manager.wait4all()
will return True and we return from the function immediately. However, it's possible that the JST/JFI/NO_JOBS have not yet been received. As a result, on the next call to wait4all()
if a job is running then AllJobsFinished
will return False, but then those previous messages will be received by the poller and wait4all()
returns while there are still running jobs.
This turned out to be a bit tricky to fix but I think I have something that works. See the explanation in #152.
An even better solution would be to replace the synchronous status request with AllJobsFinished
with a request that, on the server side, checks whether there are any active jobs, and if not posts a NO_JOBS message to the event queue. Then the client side can simply call that and then process events until it runs into a NO_JOBS one. But that requires a change in the protocol, and it will mean requiring a Poller rather than having it be optional, so I stopped short of that.
Currently if client submit 1k jobs, he asks about 1k job statuses. Instead there should be info returned from QCG-PJ if all submited jobs finished.