joe-bowman commented 3 years ago

The oracle should expose a prometheus metrics endpoint, to track vital values (e.g. eth balance, block heights, etc.)

ongrid commented 3 years ago

Hey @joe-bowman. We are considering to implement one of two possible options:

Make the oracle daemon "scrapeable". It should expose HTTP Endpoint that will be polled by Prometheus. You mean this option, right? This needs more configuration and more efforts to protect the endpoint from external threats.
Exporting (pushing) metrics to the Prometheus gateway. This approach seems to be more secure (less exposure, less attack surface). Does it suit your and typical Prometheus setup?

joe-bowman commented 3 years ago

Scrapable (pull model) is preferred, push model comes with caveats (less trivial to assert something is ‘down’ because push gateway will keep the last value. It can be worked around, by pushing a counter with the current unix time as a value (last_push) so you can alert on that not increasing.

Pull is preferred from a ops point of view, but appreciate the desire to minimise attack surface.

vshvsh commented 3 years ago

@joe-bowman we're now discussing a decision which is a blocker to implementing the metrics. Should an oracle run as a daemon (and export metrics over an endpoint) or a cronable single-run executable (can write metrics to file that is served by a separate exporter)? Given that the oracle is stateless currently, and should fire once per day at a predictable time, I'd prefer the second option for the time being. What do you think?

ongrid commented 3 years ago

The daemon runs in the docker container and won't rely on any task sheduler. The container and the process inside are always up (the main oracle coroutine sleeps and wakes up once a day or several hours). This doesn't prevent another coroutine to serve external scrap requests or export values to the Prometheus GW using built-in async mechanisms. So both options are possible. The Exporting (pushing) just needs less efforts in terms of security and more efforts on mainainer's side to configure Prometheus GW.

I'd prefer to give our operators option 1 (scrapable) but with the strong requirement to run the container inside the perimeter + introduce in-process whitelisting + kernel-level firewall rules. On the later stage I'd extend the option 1 with some proven authorization mechanism.

joe-bowman commented 3 years ago

Push works for me; I appreciate the desire to minimise attack surface.

Presumably it fires more frequently than once a day though? (i.e. each time it polls for new blocks). We don't want to only find out it is down when it has failed to update. Being alerted early to the fact it is not syncing is very useful.

vshvsh commented 3 years ago

The "real" run is once/day on mainnet, but we can have an arbitrary number of dry runs (e.g. 1/hour).

ongrid commented 3 years ago

Monitoring approach for NOPs

The Lido performance and availability metrics can be divided to Overall (about the system as a whole) and Individual (about the components of single Node Operator). The metrics are also divided by the source: some of them can be fetched from the node or from the contract getter directly. Some of them can be collected as pre-processed numbers from oracle daemon. Despite redundancy it's recommended to fetch and monitor all of these metrics.

Overall metrics of Lido

Collected from ETH1 via web3

Frames and Reports in it (to be plotted as Histogram):

Frames and their boundaries (epochs, slots, datetimes)
Is frame currently reportable or not
If SP's oracle member address given - has it already reported in the frame it or not
Estimate countdown can be plotted in Grafana (as countdown or continuously decreasing Gauge resettable on each frame switchover)
Number of distinct reported values (Alarm should be raised if >1 )

Ether in the system:

Ether.Deposited from Lido.getBeaconStat()[0] * DEPOSIT_SIZE
Ether.Buffered from Lido.getBufferedEther()
Ether.OnBeacon.Reported from Lido.getBeaconStat()[2]

Lido Validators:

Validators.Keys from NOPs registry
Validators.Deposited from getBeaconStat()[0]
Validators.OnBeacon (Contract View) from getBeaconStat()[1]

stETH supply and balances:

stETH.totalSupply
stETH.onTreasury
stETH.onInsurance
stETH.onOperators
stETH.onCstETH
stETH holders
stETH.totalHolders

cstETH:

CstETH.totalSupply
cstETH.totalHolders

Mined Txes to the Lido. To be rendered as Histogram. Visibility of net performance and funding ingress.

Submit Tx quantity (separated by status: reverted or Successful) and ETH values
Other transactions/logs counters (including oracle reports, pushes, depositBufferedEthers).

MemPool stats - visibility of network weather. To be rendered as Histogram

Txes. Pending - The number of txes targeted to Lido visible in ETH1 node mempool
Ethers. Pending - The amount of ethers to the Lido contract in TX mempool (visibility of funds in the mempool

May require some middleware (not in the scope of the current task).

Collected from NOP's oracle daemon instance

attrs and method call results current metrics: epoch, beaconBalance, beaconValidators, timestamp, bufferedBalance, depositedValidators, activeValidatorBalance, getTotalPooledEther(), getTransientValidators(), getTransientBalance()
Oracle Frames with individual reports and quorum values (to be rendered as Histogram in Grafana)

Such exporter IS in the scope of the current task (depicted as a green rectangle in the diagram)

Individual NOP-related metrics

Moving parts that a particular NOP is responsible for. Monitored by the NOP.

Infrastructure metrics from ETH1 and Beacon nodes (Geth and Lighthouse)

Allows Node Operator to detect such issues as node unavailability, DoS on node endpoint, node performance degradation.

eth1_block_number - tracked for each ETH1 node running by NOP
beacon slot and epoch - tracked for each beacon node running by NOP
beacon finalized slot and epoch - tracked for each beacon node running by NOP
networks [(mainnet / goerli)(mainnet, pyrmont, medalla ) to which the ETH1 and Beacon nodes connected

Existing solutions can be reused like Geth Ethereum Server Dashboard. May require some middleware (not in the scope of the current task).

May require some additional middleware, NOT in the scope of the current task.

Beacon validation metrics for each validator

Existing solutions can be reused like (prysm-grafana-dashboard)[https://github.com/GuillaumeMiralles/prysm-grafana-dashboard]

May require some additional middleware (to fetch keys from the Node Operators registry contract by NOP ID), NOT in the scope of the current task.

Collected from ETH1 and Beacon states (via beacon client and web3)

Eth balance of NOP's Member account - Node Operator's operations team should monitor this to keep the balance enough to transact reports.

Frames and Reports in it. Redundant. Already mentioned in Overall metrics of Lido above

- has SP already reported in the frame it or not (by given Member address)

May require tiny middleware, NOT in the scope of the current task.

Collected from NOP's oracle daemon instance

oracle daemon/container status (running/stopped)
last oracle run timestamp
next oracle run timestamp
attrs and method call results of previous and current metrics: epoch, beaconBalance, beaconValidators, timestamp, bufferedBalance, depositedValidators, activeValidatorBalance, getTotalPooledEther(), getTransientValidators(), getTransientBalance(). Collected by oracle daemon
deltas from compare_pool_metrics(): warnings, delta_seconds, appeared_validators, timedelta, depositedValidatorsChange, appearedValidators, transientValidatorsChange, transientBalanceChange, totalPooledEtherChange, reward_base, apr, reward. Collected by oracle daemon.
network (mainnet / goerli / e2e) to which the oracle daemon connected
version and metadata of oracle daemon, GAS_LIMIT, SLEEP

Such exporter IS in the scope of the current task (depicted as a green rectangle in the diagram). Intersects with oracle-exported Overall metrics mentioned above.

ongrid commented 3 years ago

@vshvsh could you please verify and confirm the overall vision and scope of the current issue?

vshvsh commented 3 years ago

On the first glance there's too much, will get back to you soon

vshvsh commented 3 years ago

Ah, I see, most of it is not in scope.

What are the typical problematic outcomes of an oracle:

It didn't deliver a data point when it should have
It delivered a wrong/dangerous data point

So we need metrics that help people to prevent both those outcomes.

The list above solves most of it but I'd like to see two more very important metrics:

Metric of "time since last possible oracle report and an actual oracle report transaction from me" - so if you're operating normally, it'll never grow beyond a few minutes but if your oracle skips report for whatever reason op gets an alert when it reaches 10m or something.

And metric of "Number of distinct reported values in the last oracle reports", if it's ever >1 it's an alert.

ongrid commented 3 years ago

Good point.

This should be observed via ETH1-Prometheus middleware from the main source of truth - contract states and their events similarly to the following piece of code ethexporter. And better to do it through the several ETH1 endpoints and then plot on the single view in Grafana. Most of time the numbers will visually be merged into one line and any deviation (node freeze or peers loss for example) will spread the lines.

I've added it to the spec.

See Frames and Reports ... in both Overall... and Individual... sections above.

All of this seem to be OUT of oracle daemon exporter's scope.

ongrid commented 3 years ago

Discussed with @vsmelov: ETH1-Prometheus-middleware and Beacon-Promentheus-middleware to be implemented as an independent exporter in the separate repository similarly to https://github.com/certusone/chainlink_exporter.

Oracle-Daemon-Prometheus exporter will stay in the oracle daemon.

lidofinance / lido-oracle

Prometheus Metrics #56

Monitoring approach for NOPs

Overall metrics of Lido

Collected from ETH1 via web3

Collected from NOP's oracle daemon instance

Individual NOP-related metrics

Infrastructure metrics from ETH1 and Beacon nodes (Geth and Lighthouse)

Beacon validation metrics for each validator

Collected from ETH1 and Beacon states (via beacon client and web3)

Collected from NOP's oracle daemon instance