Keepalived must resolve keepalive alerts

grepory commented 7 years ago

When a keepalive is missed, Keepalived creates a "keepalive" Event and sends it to messaging.TopicEvent. When it begins receiving keepalives, it currently does not do anything except stop sending these events.

Keepalived should track the state of keepalives. In each monitor's loop, when we enter the select block caused by missing a keepalive and receiving a timer event, we must:

Retrieve the entity from Etcd
Compare the timestamp on the Entity with the time emitted by the timer
- If no other sensu-backend process has updated the entity, then we truly have a keepalive time, deregister or create event
- If another sensu-backend process has updated the entity, shutdown this monitor

In the case that an agent takes too long to reconnect, and we're currently emitting keepalive events for it, a new sensu-backend must be aware of this state.

We currently do not have a mechanism for tracking this. How we approach this perhaps merits discussion, but my initial idea is:

It's likely we should simply be passing Keepalive events through Eventd (via TopicEventRaw) instead of pushing them directly to Pipelined (via TopicEvent). That way, the newly created monitor could first query Etcd, see if a keepalive event exists for this entity, and then emit a resolution event if it's currently failing.

grepory commented 7 years ago

Well. This was fun:

KeepaliveMonitor is a managed timer that is reset whenever the monitor observes a Keepalive event via the Update() function.

When a monitor receives a keepalive event at t0, it takes note of and writes to etcd the time at which the keepalive will expire t1. At t1, our timer will expire causing a failing keepalive check to be emitted. When the timer expires, we must ensure that the expiration time recorded is the time we expect. If the time has changed, regardless of whether or not it has moved forward, another process has updated that timestamp and we are no longer monitoring the keepalives for this agent. At that point, the monitor should exit so that it can be collected on the next sweep by keepalived.

There exists the possibility that a clock on another server is skewed by exactly -(KeepaliveTimeout) between the server that wrote t1 to etcd and that the agent reconnected to that server at precisely the time that the timer was to expire. In that case, the value stored in etcd would be identical to the one we had in memory. Therefore, it is not enough to know that the time happens to be the same as what we have in memory. So instead, we simply store the ID of the backend responsible for monitoring this particular entity at the time the monitor is created. If when we go to alert, the backend ID has changed, we exit immediately.

There are a number of things that can happen here.

Agent disconnects, never reconnects.

When the timer expires, and we verify that the agent has not connected elsewhere (see above), we emit a failing keepalive check for the entity which gets written to Etcd.

It's possible that immediately after we check that we are still monitoring this entity in Etcd, another backend will begin monitoring the agent. Therefore, we need to synchronize via etcd. The two backends should race for a lock for the keepalive key. If the new backend gets the lock first, then when the timer expires, we will see that the new backend is monitoring the keepalive and exit. If no other backend begins monitoring the entity, we will get the lock and alert. If we get the lock first, then we alert and the new backend, once it gets the lock, will resolve the alert immediately.

Agent disconnects and reconnects before keepalive times out--no failing keepalive event is sent.

In the case that the same backend process begins monitoring:

The Monitor will not know that anything has happened, because its state is independent of the agent's connection state. However, given that an Agent's associated Entity may change (specifically with regard to keepalive configuration) between connections (if the agent is restarted, its configuration can change), we must ensure that our keepalive timeouts are updated when the agent reconnects. So, whenever we receive a keepalive we need to do timer maintenance (update local entity pointer from event, reset timer duration, etc.)

In the case that a new backend process begins monitoring:

We retrieve the entity's associated "keepalive" check event. If the agent reconnected in time, we should either have a nil event (agent has never had a keepalive timeout) or a passing event (agent reconnected before the other process emitted an alert). Either way, we know not to do anything and proceed to monitor.

Agent disconnects and reconnects after keepalive times out--failing keepalive event has been sent.

When a new monitor starts, it must first check Etcd to see if there is a failing keepalive check event. If the event exists and is failing, then we need to resolve the event.

grepory commented 7 years ago

While I'm thinking about it, this proposal doesn't mention what happens when a sensu-backend process shuts down. Keepalived must be able to recover its state after a restart. This is particularly important in the case of a backend that is emitting keepalive alerts for an agent and shuts down. In the case where the agent does not reconnect to another backend before the new sensu-backend process starts, the new sensu-backend must resume sending keepalive events after restarting.

We need to take that into consideration and revisit the proposed storage.

sensu / sensu-go