hashgraph / hedera-mirror-node

Hedera Mirror Node archives data from consensus nodes and serves it via an API
Apache License 2.0
123 stars 108 forks source link

Add labels to all hedera-mirror-monitor metrics #8874

Open shezaan-hashgraph opened 1 month ago

shezaan-hashgraph commented 1 month ago

Problem

The DevOps team would like to request the following labels and their corresponding values to all hedera-mirror-monitor metrics as we consider them standardized metrics that we can use in Prometheus for grouping purposes.

  1. network - Name of the network e.g. testnet, mainnet, previewnet, other
  2. node_id - Node IDs are not the same as account IDs e.g. node_id: 0 corresponds to account_id: 0.0.3 or account_id: 3
  3. proxy_ip: This is the IP address of the proxy

Solution

Export metrics for hedera-mirror-monitor

  1. network - Name of the network e.g. testnet, mainnet, previewnet, other
  2. node_id - Node IDs are not the same as account IDs e.g. node_id: 0 corresponds to account_id: 0.0.3 or account_id: 3
  3. proxy_ip: This is the IP address of the proxy

Alternatives

No response

steven-sheehy commented 1 month ago

Network can be added to all metrics multiple ways. You can add it yourself to all metrics by adding it to the extraLabels in the Prometheus remote write configuration. We can add it manually to each metric, but this wouldn't add it to non hedera_ metrics. We can set management.metrics.tags.network=${hedera.mirror.monitor.network} via config, but it's a bit redundant for us since we already have namespace which is named after the network. You can try setting that property yourselves.

For the others, it's doable but would take some rework of the internals. Currently we're not passing the node info to the metrics layer. We're just extracting the list of node account IDs from the SDK Transaction in the metrics layer. It's something we can explore.

shezaan-hashgraph commented 1 month ago
  1. Adding the extraLabels in the remote write configuration is not an option as we execute a single Prometheus operator at least for all non-prod nodes (testnet, previewnet, engnet etc). As a result, if we were to add a label at that level it would be added to all metrics even for the serviceMonitors that don't belong to the network for which the extraLabels were added. I've had success relabeling the namespace label to network via the promQL query but that doesn't seem to be a transparent way of doing this, particularly when there is an incident and one of our engineers isn't able to ascertain where the network label came from without making sense of the query itself. Just my 2 cents here but runtime re-labeling is possible using the label_replace function of PromQL.
max by(node, network) (
  label_replace(
    hedera_mirror_monitor_publish_handle_seconds_max{application="hedera-mirror-monitor", status="SUCCESS"},
    "network",
    "$1",
    "namespace",
    "(.*)"
  )
)

However, for complex enough queries, these re-label expressions can become cumbersome and complex. I would also imagine that when Grafana SaaS eventually becomes prohibitively expensive over time and we decide to run our own Prometheus and Grafana servers then such types of label transformations could add up to significant overhead/load.

  1. W.r.t adding the label management.metrics.tags.network=${hedera.mirror.monitor.network} via config, which config would that be? I'm guessing it's not the values.yaml file since I don't see such a helm config exposed.

  2. Without the node_ip we are limited to writing queries that restrict us only to the node_id (which uses the account_id) and network groups i.e. we can only group by node_id and network. So at best we can determine if there is an issue with the proxy or node but would be unable to further determine which one of the two is problematic without some debugging. Having the node_ip would help us figure out if the node is the problem by enabling us to drill down to the node itself.

  3. Lastly we also need the proxy_ip in front of each node. Production for example has 2 proxies in front of each node. If one of them has a problem we would need to know which one.