kestra-io / plugin-gcp

Apache License 2.0
8 stars 10 forks source link

GCP Batch Task Runner does not retry DEADLINE_EXCEEDED but instead Fails #418

Closed japerry911 closed 3 months ago

japerry911 commented 4 months ago

I am convinced that this error below comes from Batch Task Runner receiving an error, DEADLINE_EXCEEDED, and not retrying on the API endpoint for GetJob (seeing one error at the same time this came up in the API page in GCP - see image).

image

2024-07-03 03:07:54.883io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 59.999868834s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=batch.googleapis.com/<ip>:443]]]
2024-07-03 03:07:54.883DEADLINE_EXCEEDED: deadline exceeded after 59.999868834s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[remote_addr=batch.googleapis.com/<ip>:443]]]

Is it possible to check if there are retries for Task Runner when it is polling, and if there are, that DEADLINE_EXCEEDED is retried?

This is the first time this has happened (but it happened on one of our long jobs at the very end), I figured it would be easy patch to prevent it from happening again.

Let me know if you need anymore detail, thank you team.

Environment

japerry911 commented 3 months ago
Screenshot 2024-07-09 at 8 10 37 AM

Encountered this again this morning , posting logs image for reference

loicmathieu commented 3 months ago

Hi, Can you download the logs (there is a button for that in the UI) so we have the stacktrace. Or you can make a screenshot with selecting the log level TRACE but a file is easier to investiguate.

japerry911 commented 3 months ago

kestra-execution-20240709102701-7n3O45KUDakQpyEvMIi5Eq-6WBiFnpB5TqvlWBVTWb8uk.log

Attached are the downloaded logs for effected task

loicmathieu commented 3 months ago

Thanks @japerry911, I'll see if we can add a retry at this line.

loicmathieu commented 3 months ago

Hi @japerry911 I implemented a simple retry (3x, separated by 10s). I'm preparing a backport for 0.17 so it will be in the next release.

japerry911 commented 3 months ago

Thank you @loicmathieu , that's perfect! We really appreciate it 🚀