Open samuel-form3 opened 3 years ago
We now support marking up streams individually with the configuration checks need
$ nats s info SFO_MIRROR
...
Metadata:
io.nats.monitor.lag-critical: 100
io.nats.monitor.msgs-critical: 9000
io.nats.monitor.msgs-warn: 10000
io.nats.monitor.peer-expect: 3
io.nats.monitor.peer-lag-critical: 100
io.nats.monitor.peer-seen-critical: 5m
With these in place the CLI will now configure itself from those:
$ nats server check stream --stream SFO_MIRROR --format text
SFO_MIRROR: OK
Status Detail
╭────────┬──────────────────────╮
│ Status │ Message │
├────────┼──────────────────────┤
│ OK │ 3 peers │
│ OK │ replicas are current │
│ OK │ replicas are active │
│ OK │ 11927 messages │
│ OK │ 0 sources │
│ OK │ 0 sources current │
│ OK │ 0 sources active │
│ OK │ Mirror SFO │
╰────────┴──────────────────────╯
Check Metrics
╭──────────────────┬────────┬──────┬────────────────────┬───────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Metric │ Value │ Unit │ Critical Threshold │ Warning Threshold │ Description │
├──────────────────┼────────┼──────┼────────────────────┼───────────────────┼──────────────────────────────────────────────────────────────────────┤
│ messages │ 11,927 │ │ 9,000 │ 10,000 │ Messages stored in the stream │
│ sources │ 0 │ │ 0 │ 0 │ Number of sources being consumed by this stream │
│ sources_lagged │ 0 │ │ 0 │ 0 │ Number of sources that are behind more than the configured threshold │
│ sources_inactive │ 0 │ │ 0 │ 0 │ Number of sources that are inactive │
│ lag │ 0 │ │ 100 │ 0 │ Number of operations this peer is behind its origin │
│ active │ 0.020 │ s │ 0 │ 0 │ Indicates if this peer is active and catching up if lagged │
╰──────────────────┴────────┴──────┴────────────────────┴───────────────────┴──────────────────────────────────────────────────────────────────────╯
I could look at making the actual checks optional if really still needed but seems to me the purpose of these checks are to check.
Had a look and it should in theory be possible to make a gathering-only behaviour, I did a lot of refactoring of the checks and they do seem like its easier now.
I'm also adding a prometheus exporter where you can define these checks and they will get done on every poll if thats something of interest
Current behaviour
Currently when we execute the
nats server check
command, the client expects a set of threshold flags as inputs to be able to answer if the server is healthy or not according to the provided thresholds.Feature request
It would be useful to know get current metric values, instead of just knowing if the threshold was exceeded or not.
This would enable us to create a prometheus exporter component that could export prometheus metrics on all of the
nats server check
set of commands. We would use these metrics on alerts and define the needed thresholds on the alerts themselves.Desired behaviour
Example
Currently
Example output:
Proposed
In the previous example it would export a metric saying if a given peer has lag according to the provided threshold flag
--peer-lag-critical 100
. In this example, it would just export the peer lag itself for each peer <> stream.This strategy could be applied for every other type of metric currently available on the tool.
Example output:
Thanks :pray: