sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 176 forks source link

Session handles check events and stores event in etcd #37

Closed grepory closed 7 years ago

portertech commented 7 years ago

https://github.com/sensu/sensu-go/pull/55 implements this.

grepory commented 7 years ago

In working on this, I've come to a bit of an impasse.

I dislike the workflow that this implies which is:

Agent -> WebSocket -> MessageBus -> Eventd -> MessageBus -> Pipelined

The hop between Eventd and MessageBus seems unnecessary given how little Eventd is doing.

I think this is complicated by the fact that pipelined has routing logic in it.

I think I would continue along the course of the eventd approach if the routing logic came out of eventd, i.e.:

Pipelined assumes that it is supposed to handle all messages that come to it, i.e., we move from having an "events" topic in the message bus to a "check-results" topic or something, and pipelined subscribes to that.

Then we have a "metrics" topic that we use to route metrics to external TSDBs.

The routing logic for this stuff could be in eventd. Even keepalives could be routed by eventd--as I think my original backend spec included a keepalived.

portertech commented 7 years ago

In order to maintain the current behaviour and capabilities of the Sensu pipeline, pipelined must be able to handle incidents and metrics alike.

I believe adding the distinction between check incident handlers (i.e. "incident_handlers": []) and check metric handlers (i.e. "metric_handlers": []) to pipelined would allow it to better service users for the purpose of shipping metrics while staying true to the core principle of a "monitoring router". This allows pipelined to fully envelop all of the logic a Sensu user thinks about in regards to alerting and delivering metrics to a TSDB, it's just more capable than what Sensu has today. Keep disruption to the Sensu "mental model" to a minimum.

Alternatively, a "incidentd" and "metricsd" could be used to split up the pipelined logic, however, I see issues with this, primarily in regards to duplication, e.g. metrics may need to be filtered and mutated.

I see the additional backend MessageBus use as a great "hook" opportunity for Enterprise features, allowing us to inject Enterprise in at any point, building upon the Core. Of course we cannot abuse its use. Not serializing messages would make this feel better, no doubt.

It feels like nailing down the Sensu Event and its types will greatly help us reason about things.

Just some ideas... Cannot wait to discuss tomorrow!

portertech commented 7 years ago

I don't believe eventd needs to do any means of routing, only the enhancement and storage of event data. Having eventd emit the enhanced event data to the backend MessageBus allows one or more Sensu daemon to utilize it, including Enterprise only daemons.

grepory commented 7 years ago

What is/was confusing to me is that right now pipelined explicitly ignores any Events that have Metrics. That lead me to believe that you had something else in mind for those messages.

I will finish the eventd implementation, which is almost complete, then. I do need details on the flap detection algorithm, though, and what, if anything, special needs to happen for a check that's flapping. I don't really understand the concept in Sensu or Nagios.

grepory commented 7 years ago

Worth noting that this implementation will include a new topic between Sessions and Eventd.

portertech commented 7 years ago

Pipelined does not ignore events with metrics - https://github.com/sensu/sensu-go/blob/master/backend/pipelined/filter.go#L17 if incident || metrics {, but I do think pipelined needs something like "incident_handlers": [] and "metric_handlers": [] to make separation easier to configure.

grepory commented 7 years ago

I totally misread it. Sorry for the confusion.

portertech commented 7 years ago

No problem :+1: I think we need to make the call early on (now-ish) if we're going to provide a separate method of handling metrics, both internally and from the user's perspective.

grepory commented 7 years ago

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/flapping.html

and

https://github.com/sensu/sensu/blob/a27da6df7d1e9467a07b59e9e4f9d165638fb324/lib/sensu/server/process.rb#L380

grepory commented 7 years ago

We had talked about flap detection happening in eventd. Is this still the case? If so, we should talk about how we want to handle that.

grepory commented 7 years ago

OK. I know that flap detection is a requirement, but for now i'm going to say it's out of scope so that this can be finished, and we can decide what to do about flap detection in another ticket maybe.