max-rocket-internet / k8s-event-logger

Watches k8s cluster events and logs them to stdout in JSON
GNU General Public License v3.0
219 stars 30 forks source link

RFC: Implement a follow mode based on resource version #34

Closed mmckeen closed 11 months ago

mmckeen commented 1 year ago

Right now my understanding is that this will drop events across logger pod restarts since the informer is only accepting new API events, it will also fail to log historical events from before the pod started up.

Right now, my understanding is that this will log duplicate events across pod restarts.

One way that we could solve this is to make use of a persisted ResourceVersion (probably within a ConfigMap) and start watching events from that version when the pod starts up and has access to the persisted version, or log all events present in the cache to make sure historical events have been logged.

For exceptionally active clusters, I expect this to massively reduce duplicate logged events, save on log volume, and enable easier debugging for logs due to less noise.

nathanmcgarvey-modopayments commented 1 year ago

Since events age off fairly rapidly anyway, I'm not sure that historical event logging will ever be 100% guaranteed. (E.g. k8s default is 1 hour (https://github.com/kubernetes/kubernetes/blob/7581ae812327fc8218204f678143a6f116cad931/pkg/controlplane/apiserver/options/options.go#L104)).

Also, and I haven't dug into the internals, but I actually get a few historical events on pod startup in my local development cluster for some reason.

It might be possible to use https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions with cache.NewFilteredListWatchFromClient() instead of the current cache.NewListWatchFromClient() with the ResourceVersionMatch set to ResourceVersionMatchNotOlderThan and the resource version stored somewhere persisted, as you suggested.

Out of curiosity, what pod restarts would cause a new pod to not be started before the old one is terminated? (Aside from someone doing a kubectl delete pod <>, manually. I was under the impression that rollout restarts, and pod movement between nodes was defaulted to start the new pod first, then terminate the old one once the new one was running.

Edit: As this is an RFC: For this project, I chose to use it because of it's one-source-file nearly-no dependency simplicity with a built-in helm chart that is also minimal and easily inspect-able. Adding things like configmaps for state synchronization start making things notably more complex (and harder to test/verify), and I personally (in my use case) don't mind missing a couple events here and there since pod restarts are nearly zero and they are so quick due to the fast startup time and tiny image it misses very little. I don't have any strong preferences either way, but I lean toward simplicity and predictability over covering corner cases for collecting events that are already ephemeral and informative rather than full audit logging.

mmckeen commented 1 year ago

I think you're right about the past events remaining around (I need to do my own testing).

Given that, the main motivation for the use of watching based on ResourceVersionMatchNotOlderThan would be to limit the duplicate logging of events as much as possible.

I see your point about the dropping of events not being a problem, so the main motivation for this RFC would then be to ensure that we aren't duplicate logging across pod restarts more than up to the last ResourceVersion.

Worth a discussion with @max-rocket-internet about taking on a dependency of a ConfigMap, but personally that feels worth it.

I'm quite interested in how many duplicate logs we'd expect to save with the use of ResourceVersion versus just relying on the default 1 hour event persistence. My guess would be that in extremely active clusters it may be quite significant.

max-rocket-internet commented 1 year ago

I personally (in my use case) don't mind missing a couple events here and there since pod restarts are nearly zero

I feel more in this direction. This project is so simple and that's the main attraction IMO. Also in my experience pod restarts are very seldom.

Maybe let's leave this open and see if others have strong feelings about missed/duplicated events from pod restarts? and if the significant increase in complexity is worth it 🙂

max-rocket-internet commented 1 year ago

I think you're right about the past events remaining around (I need to do my own testing).

Also, and I haven't dug into the internals, but I actually get a few historical events on pod startup in my local development cluster for some reason.

In my experience these areas around events have always been a bit "flakey", this issue for example.

max-rocket-internet commented 11 months ago

I'm gonna close this due to inactivity