jonhiggs / flamingzombies

A lightweight monitoring service
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Its too hard to chase the source of errors when using the `fz_error` task. #80

Open jonhiggs opened 3 months ago

jonhiggs commented 3 months ago

Work out how to make debugging easier when the tests are distributed across hosts.

jonhiggs commented 3 months ago

I've been thinking about this...

There are two competing priorities:

  1. Multiple daemons are needed on various hosts for redundancy. You don't want your one internet link to fail and prevent all notifications about the failure from being sent.
  2. State isn't distributed to keep things simple.

But that leads to state being distributed and it's annoyingly difficult to determine what, if anything is down. You need to query each daemon for its state, or wait for a renotify gate to open.

So I've been thinking about a way to redistribute the state of many daemons for presentation. It probably doesn't need to be a super-reliable path but if it were, hanging notifications off that would have great delivery guarantees.

The state needs to go to a database which something else will present. Some options I've been considering are:

I'm thinking OpenTelemetry is the best option. It opens up a lot of possibilities. I think I'll experiment with fz -> OTLP -> OTel Collector -> Clickhouse -> Metabase.

Also, I hope it goes without saying that using OTel is completely optional. It would make larger deployments manageable, but it certainly has an associated complexity cost.