nats-io / natscli

The NATS Command Line Interface
Apache License 2.0
453 stars 94 forks source link

Metadata based monitoring for streams and consumers #1062

Closed ripienaar closed 1 month ago

ripienaar commented 2 months ago

Demonstrating a model that could be used to enable self-service monitoring for client assets

ripienaar commented 2 months ago

Given a stream with metadata:

Metadata:

        io.nats.monitor.lag-critical: 100
         io.nats.monitor.max-sources: 34
         io.nats.monitor.min-sources: 33
       io.nats.monitor.msgs-critical: 3000
           io.nats.monitor.msgs-warn: 4000
         io.nats.monitor.peer-expect: 1
   io.nats.monitor.peer-lag-critical: 100
  io.nats.monitor.peer-seen-critical: 5m
   io.nats.monitor.subjects-critical: 30
       io.nats.monitor.subjects-warn: 33

The nats server check command with no arguments set will health check it accoeding to these metadata values. metadata keys correspond with cli flags:

$ nats server check stream --stream LON --format text
LON: OK

Status Detail

╭────────┬────────────────────╮
│ Status │ Message            │
├────────┼────────────────────┤
│ OK     │ 1 current replicas │
│ OK     │ 34 sources         │
╰────────┴────────────────────╯

Check Metrics

╭──────────────────┬───────┬──────┬────────────────────┬───────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Metric           │ Value │ Unit │ Critical Threshold │ Warning Threshold │ Description                                                          │
├──────────────────┼───────┼──────┼────────────────────┼───────────────────┼──────────────────────────────────────────────────────────────────────┤
│ peers            │ 1     │      │ 1                  │ 1                 │ Configured RAFT peers                                                │
│ peer_offline     │ 0     │      │ 0                  │ 0                 │ Offline RAFT peers                                                   │
│ peer_not_current │ 0     │      │ 0                  │ 0                 │ RAFT peers that are not current                                      │
│ peer_inactive    │ 0     │      │ 0                  │ 0                 │ Inactive RAFT peers                                                  │
│ peer_lagged      │ 0     │      │ 0                  │ 0                 │ RAFT peers that are lagged more than configured threshold            │
│ messages         │ 4,580 │      │ 3,000              │ 4,000             │ Messages stored in the stream                                        │
│ subjects         │ 34    │      │ 30                 │ 33                │ Number of subjects stored in the stream                              │
│ sources          │ 34    │      │ 34                 │ 33                │ Number of sources being consumed by this stream                      │
│ sources_lagged   │ 0     │      │ 0                  │ 0                 │ Number of sources that are behind more than the configured threshold │
│ sources_inactive │ 0     │      │ 0                  │ 0                 │ Number of sources that are inactive                                  │
╰──────────────────┴───────┴──────┴────────────────────┴───────────────────┴──────────────────────────────────────────────────────────────────────╯

Note check metrics have the given thresholds.

Do this on bulk via sys req jsz for example and full self service monitoring can be enabled.

/cc @bruth @wallyqs