slipstream / SlipStreamJobEngine

SlipStream distributed job engine
Apache License 2.0
0 stars 0 forks source link

Improve the stability of the JobEngine service #35

Closed schaubl closed 3 years ago

schaubl commented 6 years ago

To do that couple of things should be done:

a) Maximize the jobs success rate Almost all failed jobs are collect_virtual_machines jobs. And almost all the time they failed because of wrong/old credentials. With the exception of NuvlaBox because from time to time a NuvlaBox is not connected. A solution would be to suspend them for some time. This time might increase if the next try fails and at the end the credential could be disabled/removed. An implementation could look at all previous jobs for the current credential and if all of them are FAILED take the actions mentioned above.

b) Take actions against Clouds/Credentials which takes too much time to complete. For example we can run these jobs less often or we can block them for some time. At the very end we could disable/remove the credential.

c) ... ?

The issue about adding a enabled on the credential is available here: slipstream/SlipStreamServer#1467

schaubl commented 3 years ago

Doesn't apply anymore