Closed ripienaar closed 1 month ago
Given a stream with metadata:
Metadata:
io.nats.monitor.lag-critical: 100
io.nats.monitor.max-sources: 34
io.nats.monitor.min-sources: 33
io.nats.monitor.msgs-critical: 3000
io.nats.monitor.msgs-warn: 4000
io.nats.monitor.peer-expect: 1
io.nats.monitor.peer-lag-critical: 100
io.nats.monitor.peer-seen-critical: 5m
io.nats.monitor.subjects-critical: 30
io.nats.monitor.subjects-warn: 33
The nats server check
command with no arguments set will health check it accoeding to these metadata values. metadata keys correspond with cli flags:
$ nats server check stream --stream LON --format text
LON: OK
Status Detail
╭────────┬────────────────────╮
│ Status │ Message │
├────────┼────────────────────┤
│ OK │ 1 current replicas │
│ OK │ 34 sources │
╰────────┴────────────────────╯
Check Metrics
╭──────────────────┬───────┬──────┬────────────────────┬───────────────────┬──────────────────────────────────────────────────────────────────────╮
│ Metric │ Value │ Unit │ Critical Threshold │ Warning Threshold │ Description │
├──────────────────┼───────┼──────┼────────────────────┼───────────────────┼──────────────────────────────────────────────────────────────────────┤
│ peers │ 1 │ │ 1 │ 1 │ Configured RAFT peers │
│ peer_offline │ 0 │ │ 0 │ 0 │ Offline RAFT peers │
│ peer_not_current │ 0 │ │ 0 │ 0 │ RAFT peers that are not current │
│ peer_inactive │ 0 │ │ 0 │ 0 │ Inactive RAFT peers │
│ peer_lagged │ 0 │ │ 0 │ 0 │ RAFT peers that are lagged more than configured threshold │
│ messages │ 4,580 │ │ 3,000 │ 4,000 │ Messages stored in the stream │
│ subjects │ 34 │ │ 30 │ 33 │ Number of subjects stored in the stream │
│ sources │ 34 │ │ 34 │ 33 │ Number of sources being consumed by this stream │
│ sources_lagged │ 0 │ │ 0 │ 0 │ Number of sources that are behind more than the configured threshold │
│ sources_inactive │ 0 │ │ 0 │ 0 │ Number of sources that are inactive │
╰──────────────────┴───────┴──────┴────────────────────┴───────────────────┴──────────────────────────────────────────────────────────────────────╯
Note check metrics have the given thresholds.
Do this on bulk via sys req jsz for example and full self service monitoring can be enabled.
/cc @bruth @wallyqs
Demonstrating a model that could be used to enable self-service monitoring for client assets