Closed lohanspies closed 2 years ago
Since Prometheus will only gather numeric metrics, there are some things to consider when modeling the metrics.
The Sovrin Network name being monitored
Should be a label attached to all metrics.
Node alias name
Should also be a label, for each node we get (with the node being the top-level objects in the duct we are currently fetching).
Detect when a node is inaccessible and produce standard output for that situation.
This would happen outside of the exporter, either in Prometheus through Altermanager, or in Grafana.
The number of transaction per Indy ledger, especially the domain ledger.
Should work as Gauge
transactions_total
with a label per ledger.
The average read and write times for the node.
Here I wonder how the values are measured. Ideally, we could just record the total requests in a Gauge and let Prometheus infer the other metrics. Else having histograms for throughput might be fine, we just have to be careful with regards to statistically wrong double aggregations.
The uptime of the node (time is last restart).
Clearly a gauge with a label per node.
The time since last freshness check (should be less than 5 minutes).
Diff against time of the and record as Gauge
?
Node IP address information
This could be a label, same as the node name.
Total nodes in pool information
Gauge with pool name as label.
One question regarding freshness status:
When I have a test network with 4 nodes, I get 3 freshness values, as you have posted above:
"Freshness_status": {
"1": {
"Last_updated_time": "2020-07-06 23:55:07+00:00",
"Has_write_consensus": true
},
"0": {
"Last_updated_time": "2020-07-06 23:57:33+00:00",
"Has_write_consensus": true
},
"2": {
"Last_updated_time": "2020-07-06 23:57:33+00:00",
"Has_write_consensus": true
}
}
What does these numbers as keys (0,1,2) represent and how should we interpret them?
These metrics should be available on the auto-provisioned dashboards supplied with the monitoring stack. If anything else is needed or anything is missing a separate issue can be opened.
Prometheus MVP Metrics
The Sovrin Network name being monitored Should be able to get this from the pool being connected to
Node alias name
Detect when a node is inaccessible and produce standard output for that situation.
Should generate a timeout when trying to pull validator_info from inaccessible nodes.
Detect any nodes that are accessible but that are "unreachable" to some or all of the other Indy nodes.
The number of transaction per Indy ledger, especially the domain ledger.
The average read and write times for the node.
The average throughput time for the node.
The uptime of the node (time is last restart).
The time since last freshness check (should be less than 5 minutes).
Node IP address information
Total nodes in pool information