Closed japerry911 closed 3 months ago
Encountered this again this morning , posting logs image for reference
Hi, Can you download the logs (there is a button for that in the UI) so we have the stacktrace. Or you can make a screenshot with selecting the log level TRACE but a file is easier to investiguate.
kestra-execution-20240709102701-7n3O45KUDakQpyEvMIi5Eq-6WBiFnpB5TqvlWBVTWb8uk.log
Attached are the downloaded logs for effected task
Thanks @japerry911, I'll see if we can add a retry at this line.
Hi @japerry911 I implemented a simple retry (3x, separated by 10s). I'm preparing a backport for 0.17 so it will be in the next release.
Thank you @loicmathieu , that's perfect! We really appreciate it 🚀
I am convinced that this error below comes from Batch Task Runner receiving an error, DEADLINE_EXCEEDED, and not retrying on the API endpoint for GetJob (seeing one error at the same time this came up in the API page in GCP - see image).
Is it possible to check if there are retries for Task Runner when it is polling, and if there are, that DEADLINE_EXCEEDED is retried?
This is the first time this has happened (but it happened on one of our long jobs at the very end), I figured it would be easy patch to prevent it from happening again.
Let me know if you need anymore detail, thank you team.
Environment