longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.18k stars 604 forks source link

[BUG] Metrics Nodes Prometheus #5535

Open frit0-rb opened 1 year ago

frit0-rb commented 1 year ago

Describe the bug

When targeting the metrics URL, in this case http://longhorn-backend.longhorn-system:9500/metrics, the response contains metrics only from one of the nodes. However, when targeting the backing pods, as shown by the service endpoints, each of them returns information for a different single node.

Steps to reproduce

To see metrics from the service:

  1. Query metrics through a pod in the cluster: kubectl exec log_metrics --image alpine --command -- wget -O - http://longhorn-backend.longhorn-system:9500/metrics
  2. Check out the output: kubectl logs log_metrics

To see the metrics of each pod:

  1. Get the pod IPs: kubectl get endpoints longhorn-backend -n longhorn-system -o json | jq '.subsets[] | .addresses[] | .ip'
  2. For each ip: kubectl exec log_metrics --image alpine --command -- wget -O - [http://$ip:9500/metrics](http://%24ip:9500/metrics%60)
  3. Check out the output: kubectl logs log_metrics
  4. Delete the log_metrics pod if necessary.

Expected behavior

The output should include information from all nodes. At the very least, there would be 4x the number of nodes longhorn_node_status, with thenode` attribute visible for each node.

Actual behavior

If you filter metrics by longhorn_node_status, metrics for only one node are shown. longhorn_node_storage_capacity_bytes also shows only results for one node, as does longhorn_volume_state, etc.

Environment

• Longhorn version: v1.4.0 • Installation method kubectl • Kubernetes distro: Rancher-managed RKE2/k3s o Number of management nodes in the cluster: 1 o Number of worker nodes in the cluster: 2 • Node config o OS type and version: Rocky Linux 9 (also seen on RHEL 8.7) o CPU per node: 4 o Memory per node: 16Gi o Disk type: VMWare o Network bandwidth between the nodes: n/a • Underlying Infrastructure: VMWare/ESXi (also seen on HyperV) • Number of Longhorn volumes in the cluster: 12 (also seen with 26)

Additional context

tomwiggers commented 1 year ago

I am experiencing the same bug/issue on this environment:

I suspect the node of which metrics are collected is the one Prometheus is currently running on (single pod) but am not 100% sure.

Edit: The longhorn-backend service has 3 endpoints, one for each node. I see now, by looking at the logs of the manager pods which act as endpoints, that it depends on which ever one the request from Prometheus gets sent to.

This makes me think there is some sort of communication issue, but I would not know why. The Prometheus endpoint to scrape is set to longhorn-backend.<namespace>:9500. I have confirmed the metrics are available by using wget to fetch the metrics using another pod in the same namespace Prometheus is running in.

derekbit commented 1 year ago

The metrics controller is currently collecting metrics for only the current node, as seen in the codes provided in this link (https://github.com/longhorn/longhorn-manager/blob/master/metrics_collector/node_collector.go#L201). @PhanLe1010, do you know why collect metrics for only the current node?

derekbit commented 1 year ago

@tomwiggers @frit0-rb Can you check the ticket https://longhorn.io/docs/1.4.1/monitoring/integrating-with-rancher-monitoring/ rather that getting the metrics from http://longhorn-backend.longhorn-system:9500/metrics? Thank you.

tomwiggers commented 1 year ago

@derekbit I don't use Rancher to manage the cluster and don't use Rancher monitoring so I don't have the monitoring.coreos.com/v1 API.

It seems that the ServiceMonitor used in the example targets each manager seperately. If that works, then we need to do that instead of using the service which load balances requests over the manager pods. Would this not require a change in k8s manifests and Helm charts to not create a service for this (as we cannot use it for metrics anyway)?

rfpludwick commented 1 year ago

I'm in the same boat here as @tomwiggers - I'd like to see all metrics exposed without having to install Rancher, thanks.

mplab-casa commented 6 months ago

Hi! Is there any news? I need to collect all metrics without using rancher. Thanks

ahmedhassanahmedwasfy commented 2 months ago

I'm facing the same issue where I receive one node although I have three nodes, they look all good in the longhorn dashboard but not in the metrics as it only shows one node. UPDATE: Fixed the issue by not scraping the service from prometheus but added these annotations to the helm chart values so that prometheus can scrape the pods themselves

annotations:
  prometheus.io/path: "/metrics"
  prometheus.io/port: "9500"
  prometheus.io/scrape: "true"
PhanLe1010 commented 2 months ago

Hello everyone, have you tried to create a service monitor like this _https://longhorn.io/docs/1.6.0/monitoring/integrating-with-rancher-monitoring/#add-longhorn-metrics-to-the-rancher-monitoring-system ?

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-prometheus-servicemonitor
  namespace: longhorn-system
  labels:
    name: longhorn-prometheus-servicemonitor
spec:
  selector:
    matchLabels:
      app: longhorn-manager
  namespaceSelector:
    matchNames:
    - longhorn-system
  endpoints:
  - port: manager

Then setup the Prometheus instance to scrape the service monitor? This will make sure Prometheus collect the metrics from ALL Longhorn nodes

Another approach is like @ahmedhassanahmedwasfy mentioned above. Telling Prometheus to scrape the longhorn-manager pods directly by adding annotation:

annotations:
  prometheus.io/path: "/metrics"
  prometheus.io/port: "9500"
  prometheus.io/scrape: "true"

to the longhorn-manager daesmonset by setting this value https://github.com/longhorn/longhorn/blob/21a538d10198746515de9e0c0f87ccf660738393/chart/values.yaml#L487-L488