sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Remove eventd, implement its business logic in agentd #1347

Closed echlebek closed 5 years ago

echlebek commented 6 years ago

As events scale, eventd becomes unable to keep up.

Eventd is a fixed-size pipeline that doesn't scale as we add events to the system. As the number of events increases, the likelihood of an event being handled in a timely manner decreases.

Another problem is that because eventd dispatches events from a single channel to multiple goroutines, events can be processed out-of-order, depending on what the Go scheduler does. This can cause issues with check history, as check history depends on prior events to be constructed.

Since eventd's handler performs I/O, agentd will need to perform that I/O out-of-band. That is, agentd should not block incoming events while it is processing I/O. So, agentd will need to create goroutines to do this, or manage a pool of goroutines that do this.

Events should be linearized per-entity. That is, events should be processed in the order they are received. Metrics events can and should be processed separately from check events, however. Since metrics require less processing than checks, and can be far more frequent, metrics should be processed in a separate goroutine from checks.

grepory commented 5 years ago

There are a handful of places where we currently route things to eventd. For the sake of discussion, "the pipeline" below is the filter-mutate-handle workflow as found in pipelined.

apid agentd keepalived

In the case of agentd, the form of the event dictates what happens when an event is handled. If the event has a check named keepalive, then the event is first passed through keepalived (resetting a timer), then through eventd, then through the pipelined.

If the event has a check with any other name, it goes through eventd, and then through the pipelined.

If the event has no check, but has metrics, then it is passed directly to the pipelined (via eventd).

Check results

When handling check results, i.e. events with checks, the previous version of the event is fetched from Etcd, requiring network (if the event is handled on a raft non-leader) or disk I/O (in certain cases on the raft leader) to retrieve the previous state of the event and in order to update the check's history field. Retrieving the previous version of the event is required for flap detection. So all check results are necessarily I/O bound. This prevents us from passing check results to the pipeline before I/O.

We also incur a penalty hitting etcd because of check silencing. This is the most expensive part of the event-handling process in eventd. This makes at a minimum 2 read requests to Etcd. The first is a call to store.GetSilencedEntriesBySubscription. This will return all silences created for the entity's subscription (at most the number of checks the entity is subscribed to). We then look for silence entries for every subscription with which the check is associated.

2 + (# of subscriptions associated with the check)

I have filed a separate issue for this egregiousness. See #2289.

Metrics

In the case of metrics, I agree that we should bypass event storage altogether and go directly to pipelined. We already do this--though the goroutine handling all events does handle events that aren't stored in Etcd. However, that's not the only limiting factor on how quickly we can handle an event. If the event is mutated and handled by plugins, we have to fork-exec twice. Likewise, if the event is handled by one or more extensions, we incur a network penalty (though this connection should be to localhost and therefore sub-millisecond). Regardless, that connection may timeout, leaving us vulnerable to queue depth increases.

Keepalives

If an event arrives with a check with the name "keepalive," this event is a keepalive event. In agent sessions, we also tag these events so that they're handled differently from the rest of the events, instead of inspecting the contents of the event. We do not treat events sent to the API the same way--I believe this is a bug (see #2288). Keepalives are first fed through keepalived. This is a somewhat expensive process as well. It requires first a get to see if we need to register the entity, then a call to etcd to send a keepalive keeping our session alive, then we get a lease and send a put, then we attempt to delete a failing keepalive event, and then we update the entity's last seen timestamp. All said, handling a single keepalive event requires (I think) 5 round-trips to Etcd. (#2290 has been filed to see if we can improve performance here).

Once the keepalive has been handled, it's then sent to eventd and receives the same treatment that normal events do.

Linearizability

We unfortunately cannot simply spin off a goroutine for every event the agent sends us. This does not guarantee linearizability as the order in which you spawn a goroutine is not an indicator of the order in which goroutines will write to Etcd. The only way to guarantee linearizability is to have a single goroutine handling all events as they come in. Ultimately, I do not believe that linearizability is a guarantee that we can make. We allow the submission of events through multiple channels, and there's no guarantee that the sender emitting events in the correct order. When you also take into account the API and that multiple backends could receive events via the API, there is no possible way to guarantee a total ordering of events sent to Sensu.

At best, we can guarantee that events sent via the agent web socket are linearizable, but that will require that we synchronously write to etcd as we receive check results. We can handle metrics in a separate goroutine, but we are still limited to a pool of goroutines for pipeline execution, and a single goroutine for handling check results. This does not guarantee a total ordering of events, but it can be an operational constraint that if you submit check results via the agent web socket and the API for the same entity, we make no guarantee about the order of operations. I think that's reasonable.