psnc-qcg / QCG-PilotJob

The QCG Pilot Job service for execution of many computing tasks inside one allocation
Apache License 2.0
11 stars 2 forks source link

more optimal wait for finish of all jobs #95

Open pkopta opened 3 years ago

pkopta commented 3 years ago

Currently if client submit 1k jobs, he asks about 1k job statuses. Instead there should be info returned from QCG-PJ if all submited jobs finished.

LourensVeen commented 2 years ago

I think this was partially addressed in 63888cbb, but I'm running into some problems with it.

First, if a job fails to start because the description is invalid, then manager.Manager will move it into the FAILED state, rather than submitting it to the Executor. That makes sense, but in this case no NO_JOBS event is generated even if the invalid job is the last one, which can cause api.manager.Manager.wait4all() to wait for such a message forever.

Second, there's a race condition in api.manager.Manager.wait4all(). If the last job was valid and finished, and its status and NO_JOBS messages have been queued, then the AllJobsFinished check in api.manager.Manager.wait4all() will return True and we return from the function immediately. However, it's possible that the JST/JFI/NO_JOBS have not yet been received. As a result, on the next call to wait4all() if a job is running then AllJobsFinished will return False, but then those previous messages will be received by the poller and wait4all() returns while there are still running jobs.

This turned out to be a bit tricky to fix but I think I have something that works. See the explanation in #152.

An even better solution would be to replace the synchronous status request with AllJobsFinished with a request that, on the server side, checks whether there are any active jobs, and if not posts a NO_JOBS message to the event queue. Then the client side can simply call that and then process events until it runs into a NO_JOBS one. But that requires a change in the protocol, and it will mean requiring a Poller rather than having it be optional, so I stopped short of that.