gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.6k stars 1.76k forks source link

Event handler should retry with slow backend response #42422

Open programmerq opened 9 months ago

programmerq commented 9 months ago

Current Behavior

If the event watcher doesn't receive a response for long enough, the event handler will log a fatal "context deadline exceeded" error and exit.

Jan 23 23:51:15 myhost3 teleport-event-handler[1199050]: ERRO   Session ingestion failed err:<nil> id:3fad420a-2bb4-bcb8-89d6-68973b2a1cdb index:0 event-handler/session_events_job.go:121
Jan 23 23:52:58 myhost3 teleport-event-handler[1199050]: ERRO   Session ingestion failed err:<nil> id:0969deeb-b0c4-d9b9-6fab-f80414c166fa index:0 event-handler/session_events_job.go:121
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: ERRO   Error ingesting Audit Log err:rpc error: code = Unknown desc = context deadline exceeded event-handler/events_job.go:96
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: ERRO   Watcher event loop failed error:[
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: ERROR REPORT:
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: Original Error: *status.Error rpc error: code = Unknown desc = context deadline exceeded
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: Stack Trace:
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/pkg/mod/github.com/gravitational/teleport/api@v0.0.0-20231214223850-22e50b45420e/client/client.go:2208 github.com/gravitational>
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:225 main.(*TeleportEventsWatcher).getEve>
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:153 main.(*TeleportEventsWatcher).fetch
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:262 main.(*TeleportEventsWatcher).Events>
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /usr/local/go/src/runtime/asm_amd64.s:1650 runtime.goexit
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: User Message: rpc error: code = Unknown desc = context deadline exceeded] event-handler/events_job.go:81
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: ERRO   Terminating with fatal error [1]... error:[
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: ERROR REPORT:
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: Original Error: *status.Error rpc error: code = Unknown desc = context deadline exceeded
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: Stack Trace:
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/pkg/mod/github.com/gravitational/teleport/api@v0.0.0-20231214223850-22e50b45420e/client/client.go:2208 github.com/gravitational>
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:225 main.(*TeleportEventsWatcher).getEve>
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:153 main.(*TeleportEventsWatcher).fetch
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:262 main.(*TeleportEventsWatcher).Events>
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]:         /usr/local/go/src/runtime/asm_amd64.s:1650 runtime.goexit
Jan 24 00:58:55 myhost3 teleport-event-handler[1199050]: User Message: rpc error: code = Unknown desc = context deadline exceeded] lib/bail.go:28

Expected Behavior

A "context deadline exceeded" error message and stack trace is not a strong indicator of the underlying cause for the issue. A better error should be displayed, and "context deadline exceeded" should not appear in user-facing application logs.

In this particular case, the event handler experienced this error on a Teleport cloud backend that appears to have had the Athena query take a bit too long. These errors appear to happen if the query takes longer than a minute.

A better behavior for this particular condition would be for the event handler to identify the cause for the context deadline exceeded. If it is a case where the backend just took a bit longer than ideal, (and not some other cause like network connectivity), it should wait a bit and retry.

Not every teleport-event-handler instance has automatic restarts available. In environments where it is set to automatically restart, a wait and retry is essentially what happens from the point of view of the auth service. The event handler could just take care of that instead of doing a fatal panic and exiting. It might not even warrant more than a DEBUG level log that it is doing a single retry.

programmerq commented 9 months ago

Another instance of this same "context deadline exceeded" happened, but the event-handler didn't exit. It simply stopped handling events and required manual intervention.

This was with 14.2.3 connected to a 14.3.3 cloud backend.

Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERRO   Error ingesting Audit Log err:rpc error: code = Unknown desc = context deadline exceeded event-handler/events_job.go:96
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERRO   Watcher event loop failed error:[
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERROR REPORT:
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: Original Error: *status.Error rpc error: code = Unknown desc = context deadline exceeded
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: Stack Trace:
Jan 24 16:28:36 forwarder teleport-event-handler[78991]:         /go/pkg/mod/github.com/gravitational/teleport/api@v0.0.0-20231214223850-22e50b45420e/client/client.go:2208 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
Jan 24 16:28:36 forwarder teleport-event-handler[78991]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:225 main.(*TeleportEventsWatcher).getEvents
Jan 24 16:28:36 forwarder teleport-event-handler[78991]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:153 main.(*TeleportEventsWatcher).fetch
Jan 24 16:28:36 forwarder teleport-event-handler[78991]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:293 main.(*TeleportEventsWatcher).Events.func1
Jan 24 16:28:36 forwarder teleport-event-handler[78991]:         /usr/local/go/src/runtime/asm_amd64.s:1650 runtime.goexit
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: User Message: rpc error: code = Unknown desc = context deadline exceeded] event-handler/events_job.go:81
Jan 24 16:28:36 forwarder teleport-event-handler[78991]: ERRO   Failed to acquire semaphore error:[context canceled] event-handler/session_events_job.go:91
Jan 25 10:50:04 forwarder teleport-event-handler[78991]: INFO   Attempting graceful shutdown... lib/signals.go:46
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: INFO   Graceful shutdown failed. Trying fast shutdown... lib/signals.go:48
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: ERRO   Terminating with fatal error [1]... error:[
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: ERROR REPORT:
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: Original Error: *status.Error rpc error: code = Unknown desc = context deadline exceeded
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: Stack Trace:
Jan 25 10:50:09 forwarder teleport-event-handler[78991]:         /go/pkg/mod/github.com/gravitational/teleport/api@v0.0.0-20231214223850-22e50b45420e/client/client.go:2208 github.com/gravitational/teleport/api/client.(*Client).SearchUnstructuredEvents
Jan 25 10:50:09 forwarder teleport-event-handler[78991]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:225 main.(*TeleportEventsWatcher).getEvents
Jan 25 10:50:09 forwarder teleport-event-handler[78991]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:153 main.(*TeleportEventsWatcher).fetch
Jan 25 10:50:09 forwarder teleport-event-handler[78991]:         /go/src/github.com/gravitational/teleport-plugins/event-handler/teleport_events_watcher.go:293 main.(*TeleportEventsWatcher).Events.func1
Jan 25 10:50:09 forwarder teleport-event-handler[78991]:         /usr/local/go/src/runtime/asm_amd64.s:1650 runtime.goexit
Jan 25 10:50:09 forwarder teleport-event-handler[78991]: User Message: rpc error: code = Unknown desc = context deadline exceeded] lib/bail.go:28
hugoShaka commented 5 months ago

Might be mitigated by @tigrato's https://github.com/gravitational/teleport/pull/42089