Some more error handling. The new method does have its drawbacks as well, found some edge cases and quirks.
Problems were mostly related to scenarios like this:
Job already "removed", e.g. when I set the TTL very short for testing purposes, the dir would be moved to __finished_jobs__, even though we don't technically kill the job. Then when a result actually comes in, like from the FETCH_INPUT => READ_INPUT, it'll try to update state and log... I have done some try...except to improve the situation, but it's kinda error-prone until we have a way to actually stop the job.
Please take a look at these changes and see if it works for you. Feel free to make any changes necessary to improve stability.
Some more error handling. The new method does have its drawbacks as well, found some edge cases and quirks.
Problems were mostly related to scenarios like this:
__finished_jobs__
, even though we don't technically kill the job. Then when a result actually comes in, like from the FETCH_INPUT => READ_INPUT, it'll try to update state and log... I have done sometry...except
to improve the situation, but it's kinda error-prone until we have a way to actually stop the job.Please take a look at these changes and see if it works for you. Feel free to make any changes necessary to improve stability.