Open programmerq opened 9 months ago
Another instance of this same "context deadline exceeded" happened, but the event-handler didn't exit. It simply stopped handling events and required manual intervention.
This was with 14.2.3 connected to a 14.3.3 cloud backend.
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERRO Error ingesting Audit Log err:rpc error: code = Unknown desc = context deadline exceeded event-handler/events_job.go:96
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERRO Watcher event loop failed error:[
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERROR REPORT:
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: Original Error: *status.Error rpc error: code = Unknown desc = context deadline exceeded
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: Stack Trace:
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: /go/pkg/mod/github.com/gravitational/teleport/api@v0.0.0-20231214223850-22e50b45420e/client/client.go:2208 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:225 main.(*TeleportEventsWatcher).getEvents
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:153 main.(*TeleportEventsWatcher).fetch
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:293 main.(*TeleportEventsWatcher).Events.func1
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: /usr/local/go/src/runtime/asm_amd64.s:1650 runtime.goexit
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: User Message: rpc error: code = Unknown desc = context deadline exceeded] event-handler/events_job.go:81
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERRO Failed to acquire semaphore error:[context canceled] event-handler/session_events_job.go:91
Jan 25 10:50:04 forwarder teleport-event-handler[78991]: INFO Attempting graceful shutdown... lib/signals.go:46
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: INFO Graceful shutdown failed. Trying fast shutdown... lib/signals.go:48
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: ERRO Terminating with fatal error [1]... error:[
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: ERROR REPORT:
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: Original Error: *status.Error rpc error: code = Unknown desc = context deadline exceeded
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: Stack Trace:
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: /go/pkg/mod/github.com/gravitational/teleport/api@v0.0.0-20231214223850-22e50b45420e/client/client.go:2208 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:225 main.(*TeleportEventsWatcher).getEvents
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:153 main.(*TeleportEventsWatcher).fetch
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:293 main.(*TeleportEventsWatcher).Events.func1
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: /usr/local/go/src/runtime/asm_amd64.s:1650 runtime.goexit
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: User Message: rpc error: code = Unknown desc = context deadline exceeded] lib/bail.go:28
Might be mitigated by @tigrato's https://github.com/gravitational/teleport/pull/42089
Current Behavior
If the event watcher doesn't receive a response for long enough, the event handler will log a fatal "context deadline exceeded" error and exit.
Expected Behavior
A "context deadline exceeded" error message and stack trace is not a strong indicator of the underlying cause for the issue. A better error should be displayed, and "context deadline exceeded" should not appear in user-facing application logs.
In this particular case, the event handler experienced this error on a Teleport cloud backend that appears to have had the Athena query take a bit too long. These errors appear to happen if the query takes longer than a minute.
A better behavior for this particular condition would be for the event handler to identify the cause for the context deadline exceeded. If it is a case where the backend just took a bit longer than ideal, (and not some other cause like network connectivity), it should wait a bit and retry.
Not every teleport-event-handler instance has automatic restarts available. In environments where it is set to automatically restart, a wait and retry is essentially what happens from the point of view of the auth service. The event handler could just take care of that instead of doing a fatal panic and exiting. It might not even warrant more than a DEBUG level log that it is doing a single retry.