Old list

Monitoring Individual Services

We should be tracking the following services with the following conditions. Services are ranked in order of priority.

Hubble
- Should be indexing all chains.
- All chains should be up to date, meaning we've indexed one of the top 4 blocks.
Postgres
- Should be operational
Hasura
- Should be operational
[x] Website app.union.build Should be online
[x] Website union.build Should be online
Voyager
- Should be relaying all packets
Galois
- Should be generating proofs within a reasonable time

Low priority/ skip for now:

Validators/RPCs of the core team
- Should be proposing blocks
- Should respond to RPC queries

For all of these services, we should have a datadog agent running on them. We should also be testing all important vitals (CPU/RAM/Disk/Network) We should set up pagerduty uch that if any of these go down, we get a call

We should aggregate all of this on our datadog dashboard

Sentinel

We should also create a service that every half hour, sends a packet between all pairs of chains we have, and see if they arrive. Ideally this is nicely interspaced.

So if we have connected chains A, B, C, D

We need to send

A --> B
B --> A
A --> C
C --> A
B --> C
C --> B
...etc

Do this such that every transfer A --> B occurs every half-hour, and then evenly space out every X --> Y pair (rather than doing all of them at the same time).

This service should be written in Rust with a NixOS Module and be deployed to a dedicated machine.

The results of this should also be included in the before mentioned datadog dashboard.

benluelo commented 5 months ago

Recommended crates:

https://docs.rs/tokio/latest/tokio/ https://docs.rs/reqwest/latest/reqwest/ https://docs.rs/sqlx/latest/sqlx/ https://docs.rs/tonic/latest/tonic/ (for grpc) https://docs.rs/tracing/latest/tracing/

Note that we have autogenerated grpc code in the monorepo for galois & uniond, which can be reused in this monitoring software

KaiserKarel commented 5 months ago

@cor why build a status website ourselves? Normally the flow is:

Services (1)	v logs and metrics (2)

v incidents response (3) | v status page (4)

It's an insanely bad idea to have 2, 3 and 4 in house and custom, because then you need further layers to monitor those. Every major SaaS has brand-able status page support, with integrations to other SaaSes.

cor commented 5 months ago

Fully agree, let's use a SaaS for the status page instead

Caglankaan commented 5 months ago

Monitoring Individual Services

All of the following alerts are configured to run based on the last 5 minutes' average queries:

CPU Idle Alert: Trigger if CPU idle is less than 20%.
Memory Usage Alert: Trigger if ((avg:system.mem.total{*} - avg:system.mem.usable{*}) / avg:system.mem.total{*}) * 100 exceeds 80%. This corresponds to (MemTotal - MemAvailable) / MemTotal * 100 from /proc/meminfo.
Disk Usage Alert: Trigger if disk usage exceeds 80%.
Hubble-systemd Alert: ~Trigger if there is a warn or error~ Checks if systemd process is on or not.
Hasura-systemd Alert: Checks if the systemd process is on or not.
Postgres-systemd Alert: Checks if the systemd process is on or not.
Index Status: Checks if graphql index_status page is alive or not.
Postgres Alerts (both are default PostgreSQL alerts):
- Replication delay is abnormally high (more than 100%).
- Number of connections is approaching the connection limit (more than 90%).

For the escalation policy, if we prefer not to upgrade our plan, we can select one on-call person who will receive all notifications first. If this person does not respond, Betterstack can be set to call or send notifications to the entire team after a specified number of minutes. To reactivate monitoring after an alert, the Resolve button for that specific monitor must be clicked; otherwise, no further alerts will be received. Also added all of them into our status webpage

Trigger Cases for Other Services

For the following services, we need to determine the appropriate trigger cases:

app.union.build
union.build
Voyager
Galois

For web services, a ping server might be appropriate. However, for other services like Voyager or Galois, we need to define specific conditions to detect issues. I am not entirely sure about these cases and would appreciate further input to establish suitable monitoring criteria.

benluelo commented 5 months ago

For voyager, we will expose telemetry for all of the chains being relayed between that reports the current height it's fetching; this can then be cross referenced with chain data to see if voyager is caught up.

Galois exposes (or can expose) a health endpoint, which can be used as expected

unionlabs / union

[Tracking] Monitoring #1926

Old list