lidofinance / lido-oracle

Pythonic Lido Oracle daemon
GNU General Public License v3.0
46 stars 26 forks source link

Prometheus Metrics #56

Closed joe-bowman closed 1 year ago

joe-bowman commented 3 years ago

The oracle should expose a prometheus metrics endpoint, to track vital values (e.g. eth balance, block heights, etc.)

ongrid commented 3 years ago

Hey @joe-bowman. We are considering to implement one of two possible options:

joe-bowman commented 3 years ago

Scrapable (pull model) is preferred, push model comes with caveats (less trivial to assert something is ‘down’ because push gateway will keep the last value. It can be worked around, by pushing a counter with the current unix time as a value (last_push) so you can alert on that not increasing.

Pull is preferred from a ops point of view, but appreciate the desire to minimise attack surface.

vshvsh commented 3 years ago

@joe-bowman we're now discussing a decision which is a blocker to implementing the metrics. Should an oracle run as a daemon (and export metrics over an endpoint) or a cronable single-run executable (can write metrics to file that is served by a separate exporter)? Given that the oracle is stateless currently, and should fire once per day at a predictable time, I'd prefer the second option for the time being. What do you think?

ongrid commented 3 years ago

The daemon runs in the docker container and won't rely on any task sheduler. The container and the process inside are always up (the main oracle coroutine sleeps and wakes up once a day or several hours). This doesn't prevent another coroutine to serve external scrap requests or export values to the Prometheus GW using built-in async mechanisms. So both options are possible. The Exporting (pushing) just needs less efforts in terms of security and more efforts on mainainer's side to configure Prometheus GW.

I'd prefer to give our operators option 1 (scrapable) but with the strong requirement to run the container inside the perimeter + introduce in-process whitelisting + kernel-level firewall rules. On the later stage I'd extend the option 1 with some proven authorization mechanism.

joe-bowman commented 3 years ago

Push works for me; I appreciate the desire to minimise attack surface.

Presumably it fires more frequently than once a day though? (i.e. each time it polls for new blocks). We don't want to only find out it is down when it has failed to update. Being alerted early to the fact it is not syncing is very useful.

vshvsh commented 3 years ago

The "real" run is once/day on mainnet, but we can have an arbitrary number of dry runs (e.g. 1/hour).

ongrid commented 3 years ago

Monitoring approach for NOPs

The Lido performance and availability metrics can be divided to Overall (about the system as a whole) and Individual (about the components of single Node Operator). The metrics are also divided by the source: some of them can be fetched from the node or from the contract getter directly. Some of them can be collected as pre-processed numbers from oracle daemon. Despite redundancy it's recommended to fetch and monitor all of these metrics.

Overall metrics of Lido

Collected from ETH1 via web3

Frames and Reports in it (to be plotted as Histogram):

Ether in the system:

Lido Validators:

stETH supply and balances:

cstETH:

Mined Txes to the Lido. To be rendered as Histogram. Visibility of net performance and funding ingress.

MemPool stats - visibility of network weather. To be rendered as Histogram

May require some middleware (not in the scope of the current task).

Collected from NOP's oracle daemon instance

Such exporter IS in the scope of the current task (depicted as a green rectangle in the diagram)

Individual NOP-related metrics

Moving parts that a particular NOP is responsible for. Monitored by the NOP.

Infrastructure metrics from ETH1 and Beacon nodes (Geth and Lighthouse)

Allows Node Operator to detect such issues as node unavailability, DoS on node endpoint, node performance degradation.

Existing solutions can be reused like Geth Ethereum Server Dashboard. May require some middleware (not in the scope of the current task).

May require some additional middleware, NOT in the scope of the current task.

Beacon validation metrics for each validator

Existing solutions can be reused like (prysm-grafana-dashboard)[https://github.com/GuillaumeMiralles/prysm-grafana-dashboard]

May require some additional middleware (to fetch keys from the Node Operators registry contract by NOP ID), NOT in the scope of the current task.

Collected from ETH1 and Beacon states (via beacon client and web3)

Frames and Reports in it. Redundant. Already mentioned in Overall metrics of Lido above

May require tiny middleware, NOT in the scope of the current task.

Collected from NOP's oracle daemon instance

Such exporter IS in the scope of the current task (depicted as a green rectangle in the diagram). Intersects with oracle-exported Overall metrics mentioned above.

Screenshot 2020-12-24 at 11 21 01
ongrid commented 3 years ago

@vshvsh could you please verify and confirm the overall vision and scope of the current issue?

vshvsh commented 3 years ago

On the first glance there's too much, will get back to you soon

vshvsh commented 3 years ago

Ah, I see, most of it is not in scope.

What are the typical problematic outcomes of an oracle:

  1. It didn't deliver a data point when it should have
  2. It delivered a wrong/dangerous data point

So we need metrics that help people to prevent both those outcomes.

The list above solves most of it but I'd like to see two more very important metrics:

Metric of "time since last possible oracle report and an actual oracle report transaction from me" - so if you're operating normally, it'll never grow beyond a few minutes but if your oracle skips report for whatever reason op gets an alert when it reaches 10m or something.

And metric of "Number of distinct reported values in the last oracle reports", if it's ever >1 it's an alert.

ongrid commented 3 years ago

Good point.

This should be observed via ETH1-Prometheus middleware from the main source of truth - contract states and their events similarly to the following piece of code ethexporter. And better to do it through the several ETH1 endpoints and then plot on the single view in Grafana. Most of time the numbers will visually be merged into one line and any deviation (node freeze or peers loss for example) will spread the lines.

I've added it to the spec.

See Frames and Reports ... in both Overall... and Individual... sections above.

All of this seem to be OUT of oracle daemon exporter's scope.

ongrid commented 3 years ago

Discussed with @vsmelov: ETH1-Prometheus-middleware and Beacon-Promentheus-middleware to be implemented as an independent exporter in the separate repository similarly to https://github.com/certusone/chainlink_exporter.

Oracle-Daemon-Prometheus exporter will stay in the oracle daemon.