Define initial metrics to export to Prometheus as MVP

lohanspies commented 4 years ago

Prometheus MVP Metrics

The Sovrin Network name being monitored Should be able to get this from the pool being connected to
Node alias name
Detect when a node is inaccessible and produce standard output for that situation.

Should generate a timeout when trying to pull validator_info from inaccessible nodes.

Detect any nodes that are accessible but that are "unreachable" to some or all of the other Indy nodes.

That indicates that the internal port to the node is not accessible, even though the public port is accessible.

"Reachable_nodes": [
    [
      "Node1",
      0
    ],
    [
      "Node3",
      null
    ],
    [
      "Node4",
      null
    ]
  ],
  "Unreachable_nodes": [
    [
      "Node2",
      null
    ]
  ],
  "Reachable_nodes_count": 3,
  "Unreachable_nodes_count": 1,

The number of transaction per Indy ledger, especially the domain ledger.

"transaction-count": {
          "ledger": 21,
          "pool": 4,
          "config": 0,
          "audit": 1042
        },

The average read and write times for the node.

"throughput": {
          "0": 0.0017547843
        },
        "master throughput": 0.0017547843,
        "total requests": 16,
        "avg backup throughput": null,
        "master throughput ratio": null,
        "average-per-second": {
          "read-transactions": 0.0338584473,
          "write-transactions": 0.0001539895
        },

The average throughput time for the node.

"throughput": {
          "0": 0.0017547843
        },
        "master throughput": 0.0017547843,
        "total requests": 16,
        "avg backup throughput": null,
        "master throughput ratio": null,
        "average-per-second": {
          "read-transactions": 0.0338584473,
          "write-transactions": 0.0001539895
        },

The uptime of the node (time is last restart).

"transaction-count": {
          "ledger": 21,
          "pool": 4,
          "config": 0,
          "audit": 1042
        },
        "uptime": 103903
      },

The time since last freshness check (should be less than 5 minutes).

      "Freshness_status": {
        "1": {
          "Last_updated_time": "2020-07-06 23:55:07+00:00",
          "Has_write_consensus": true
        },
        "0": {
          "Last_updated_time": "2020-07-06 23:57:33+00:00",
          "Has_write_consensus": true
        },
        "2": {
          "Last_updated_time": "2020-07-06 23:57:33+00:00",
          "Has_write_consensus": true
        }
      }

Node IP address information

"Node_info": {
      "Name": "Node4",
      "Mode": "participating",
      "Client_port": 9708,
      "Client_ip": "0.0.0.0",
      "Client_protocol": "tcp",
      "Node_port": 9707,
      "Node_ip": "0.0.0.0",

Total nodes in pool information

"Pool_info": {
      "Read_only": false,
      "Total_nodes_count": 4,

kiview commented 4 years ago

Since Prometheus will only gather numeric metrics, there are some things to consider when modeling the metrics.

The Sovrin Network name being monitored

Should be a label attached to all metrics.

Node alias name

Should also be a label, for each node we get (with the node being the top-level objects in the duct we are currently fetching).

Detect when a node is inaccessible and produce standard output for that situation.

This would happen outside of the exporter, either in Prometheus through Altermanager, or in Grafana.

The number of transaction per Indy ledger, especially the domain ledger.

Should work as Gauge transactions_total with a label per ledger.

The average read and write times for the node.

Here I wonder how the values are measured. Ideally, we could just record the total requests in a Gauge and let Prometheus infer the other metrics. Else having histograms for throughput might be fine, we just have to be careful with regards to statistically wrong double aggregations.

The uptime of the node (time is last restart).

Clearly a gauge with a label per node.

The time since last freshness check (should be less than 5 minutes).

Diff against time of the and record as Gauge?

Node IP address information

This could be a label, same as the node name.

Total nodes in pool information

Gauge with pool name as label.

kiview commented 4 years ago

One question regarding freshness status:

When I have a test network with 4 nodes, I get 3 freshness values, as you have posted above:

          "Freshness_status": {
            "1": {
              "Last_updated_time": "2020-07-06 23:55:07+00:00",
              "Has_write_consensus": true
            },
            "0": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            },
            "2": {
              "Last_updated_time": "2020-07-06 23:57:33+00:00",
              "Has_write_consensus": true
            }
          }

What does these numbers as keys (0,1,2) represent and how should we interpret them?

WadeBarnes commented 2 years ago

These metrics should be available on the auto-provisioned dashboards supplied with the monitoring stack. If anything else is needed or anything is missing a separate issue can be opened.

hyperledger / indy-node-monitor

Define initial metrics to export to Prometheus as MVP #5