Closed tomwhite closed 6 months ago
Hi @tomwhite, thanks for reporting.
Note that the issue of a missing worker_end_tstamp
has been recently fixed in here
How does job monitoring and TimeoutError
work?
TimeoutError
by themselves 5 seconds before reaching the execution_timeout
.TimeoutError
if a function exceeds the execution_timeout
and no status object is found in the storage backend. In this scenario, Lithops generates a fake call_status
. You can view this logic here. This prevents Lithops from waiting indefinitely for functions that have been terminated by the cloud provider for no reason, or functions that take extremely long to finish in comparison to other functions from the same map.Hi @JosepSampe, thanks for the explanation! I'll try out the fix and see if it helps in this case.
Not sure if it is also related to the issue you have experienced, but you can see a basic "retry failed invocations" example here that should work properly. ( I Don't know if you are using a more sophisticated mechanism in Cubed to retry a task using the same FunctionExecutor()
)
Note that the issue of a missing
worker_end_tstamp
has been recently fixed in here
That seems to fix the issue - thanks!
( I Don't know if you are using a more sophisticated mechanism in Cubed to retry a task using the same
FunctionExecutor()
)
I've opened #1289.
I am seeing the following error which looks like a race condition:
It's triggered by running a large Cubed workload on Lithops, but I don't think it's Cubed-specific. It looks like the following is what's happening:
Here's a part of the Lithops logs showing these events. (I have patched Lithops to not raise the exception, but instead print "unknown" when the
worker_end_tstamp
key is missing.)Notice the last three lines, where first the job monitor finishes, then there is a status update from call 04260 with the missing key.
I'm not sure how to reproduce this on a small example. The workaround in https://github.com/lithops-cloud/lithops/compare/master...tomwhite:lithops:missing-worker-exec-time helps the job run to completion, but doesn't fix the underlying problem.