hyperledger / indy-node-monitor

Apache License 2.0
13 stars 27 forks source link

Define initial metrics to export to Prometheus as MVP #5

Closed lohanspies closed 2 years ago

lohanspies commented 4 years ago

Prometheus MVP Metrics

Should generate a timeout when trying to pull validator_info from inaccessible nodes.

kiview commented 4 years ago

Since Prometheus will only gather numeric metrics, there are some things to consider when modeling the metrics.

The Sovrin Network name being monitored

Should be a label attached to all metrics.

Node alias name

Should also be a label, for each node we get (with the node being the top-level objects in the duct we are currently fetching).

Detect when a node is inaccessible and produce standard output for that situation.

This would happen outside of the exporter, either in Prometheus through Altermanager, or in Grafana.

The number of transaction per Indy ledger, especially the domain ledger.

Should work as Gauge transactions_total with a label per ledger.

The average read and write times for the node.

Here I wonder how the values are measured. Ideally, we could just record the total requests in a Gauge and let Prometheus infer the other metrics. Else having histograms for throughput might be fine, we just have to be careful with regards to statistically wrong double aggregations.

The uptime of the node (time is last restart).

Clearly a gauge with a label per node.

The time since last freshness check (should be less than 5 minutes).

Diff against time of the and record as Gauge?

Node IP address information

This could be a label, same as the node name.

Total nodes in pool information

Gauge with pool name as label.

kiview commented 4 years ago

One question regarding freshness status:

When I have a test network with 4 nodes, I get 3 freshness values, as you have posted above:

          "Freshness_status": {
            "1": {
              "Last_updated_time": "2020-07-06 23:55:07+00:00",
              "Has_write_consensus": true
            },
            "0": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            },
            "2": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            }
          }

What does these numbers as keys (0,1,2) represent and how should we interpret them?

WadeBarnes commented 2 years ago

These metrics should be available on the auto-provisioned dashboards supplied with the monitoring stack. If anything else is needed or anything is missing a separate issue can be opened.