unionlabs / union

The trust-minimized, zero-knowledge bridging protocol, designed for censorship resistance, extremely high security, and usage in decentralized finance.
https://union.build
Apache License 2.0
370 stars 27 forks source link

[Tracking] Monitoring #1926

Open cor opened 5 months ago

cor commented 5 months ago

shelved:

Old list


Monitoring Individual Services

We should be tracking the following services with the following conditions. Services are ranked in order of priority.

Low priority/ skip for now:

For all of these services, we should have a datadog agent running on them. We should also be testing all important vitals (CPU/RAM/Disk/Network) We should set up pagerduty uch that if any of these go down, we get a call

We should aggregate all of this on our datadog dashboard

Sentinel

We should also create a service that every half hour, sends a packet between all pairs of chains we have, and see if they arrive. Ideally this is nicely interspaced.

So if we have connected chains A, B, C, D

We need to send

A --> B
B --> A
A --> C
C --> A
B --> C
C --> B
...etc

Do this such that every transfer A --> B occurs every half-hour, and then evenly space out every X --> Y pair (rather than doing all of them at the same time).

This service should be written in Rust with a NixOS Module and be deployed to a dedicated machine.

The results of this should also be included in the before mentioned datadog dashboard.

benluelo commented 5 months ago

Recommended crates:

https://docs.rs/tokio/latest/tokio/ https://docs.rs/reqwest/latest/reqwest/ https://docs.rs/sqlx/latest/sqlx/ https://docs.rs/tonic/latest/tonic/ (for grpc) https://docs.rs/tracing/latest/tracing/

Note that we have autogenerated grpc code in the monorepo for galois & uniond, which can be reused in this monitoring software

KaiserKarel commented 5 months ago

@cor why build a status website ourselves? Normally the flow is:

Services (1) v logs and metrics (2)

v incidents response (3) | v status page (4)

It's an insanely bad idea to have 2, 3 and 4 in house and custom, because then you need further layers to monitor those. Every major SaaS has brand-able status page support, with integrations to other SaaSes.

cor commented 5 months ago

Fully agree, let's use a SaaS for the status page instead

Caglankaan commented 5 months ago

Monitoring Individual Services

All of the following alerts are configured to run based on the last 5 minutes' average queries:

For the escalation policy, if we prefer not to upgrade our plan, we can select one on-call person who will receive all notifications first. If this person does not respond, Betterstack can be set to call or send notifications to the entire team after a specified number of minutes. To reactivate monitoring after an alert, the Resolve button for that specific monitor must be clicked; otherwise, no further alerts will be received. Also added all of them into our status webpage

Trigger Cases for Other Services

For the following services, we need to determine the appropriate trigger cases:

For web services, a ping server might be appropriate. However, for other services like Voyager or Galois, we need to define specific conditions to detect issues. I am not entirely sure about these cases and would appreciate further input to establish suitable monitoring criteria.

benluelo commented 5 months ago

For voyager, we will expose telemetry for all of the chains being relayed between that reports the current height it's fetching; this can then be cross referenced with chain data to see if voyager is caught up.

Galois exposes (or can expose) a health endpoint, which can be used as expected