Open cor opened 5 months ago
https://docs.rs/tokio/latest/tokio/ https://docs.rs/reqwest/latest/reqwest/ https://docs.rs/sqlx/latest/sqlx/ https://docs.rs/tonic/latest/tonic/ (for grpc) https://docs.rs/tracing/latest/tracing/
Note that we have autogenerated grpc code in the monorepo for galois & uniond, which can be reused in this monitoring software
@cor why build a status website ourselves? Normally the flow is:
Services (1) | v logs and metrics (2) |
---|
v incidents response (3) | v status page (4)
It's an insanely bad idea to have 2, 3 and 4 in house and custom, because then you need further layers to monitor those. Every major SaaS has brand-able status page support, with integrations to other SaaSes.
Fully agree, let's use a SaaS for the status page instead
All of the following alerts are configured to run based on the last 5 minutes' average queries:
((avg:system.mem.total{*} - avg:system.mem.usable{*}) / avg:system.mem.total{*}) * 100
exceeds 80%. This corresponds to (MemTotal - MemAvailable) / MemTotal * 100
from /proc/meminfo
.warn
or error
~ Checks if systemd process is on or not.For the escalation policy, if we prefer not to upgrade our plan, we can select one on-call person who will receive all notifications first. If this person does not respond, Betterstack can be set to call or send notifications to the entire team after a specified number of minutes. To reactivate monitoring after an alert, the Resolve
button for that specific monitor must be clicked; otherwise, no further alerts will be received.
Also added all of them into our status webpage
For the following services, we need to determine the appropriate trigger cases:
For web services, a ping server might be appropriate. However, for other services like Voyager or Galois, we need to define specific conditions to detect issues. I am not entirely sure about these cases and would appreciate further input to establish suitable monitoring criteria.
For voyager, we will expose telemetry for all of the chains being relayed between that reports the current height it's fetching; this can then be cross referenced with chain data to see if voyager is caught up.
Galois exposes (or can expose) a health endpoint, which can be used as expected
shelved:
Old list
Monitoring Individual Services
We should be tracking the following services with the following conditions. Services are ranked in order of priority.
Low priority/ skip for now:
For all of these services, we should have a datadog agent running on them. We should also be testing all important vitals (CPU/RAM/Disk/Network) We should set up pagerduty uch that if any of these go down, we get a call
We should aggregate all of this on our datadog dashboard
Sentinel
We should also create a service that every half hour, sends a packet between all pairs of chains we have, and see if they arrive. Ideally this is nicely interspaced.
So if we have connected chains
A, B, C, D
We need to send
Do this such that every transfer
A --> B
occurs every half-hour, and then evenly space out everyX --> Y
pair (rather than doing all of them at the same time).This service should be written in Rust with a NixOS Module and be deployed to a dedicated machine.
The results of this should also be included in the before mentioned datadog dashboard.