vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.71k stars 1.4k forks source link

No prometheus metrics for DataMover #6954

Closed ivomarino closed 9 months ago

ivomarino commented 1 year ago

We see data in Grafana for backups but not for the DataMover part:

image image

The DataMover is working fine and copies data to S3 using kopia:

kubectl -n velero get datauploads -nvelero 
NAME                                                  STATUS       STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
foo-stage-velero-monitoring-20231013101819-b99wr      InProgress   29m       768883947    13258881662   foo-dc1           29m     worker-641b01f2e8637
foo-stage-velero-bar-dev-20231013103233-9tjhm   InProgress   15m       4022582406   5891101065    foo-dc1           15m     worker-641b01f2e8637

It seems this part of the dashboard expects metrics like round(sum(increase(podVolume_data_upload_success_total{node=~"$fsb_node"}[1h]))) which we don't have in prometheus. This are the helm values so far for velero:

        metrics:
          enabled: true
          scrapeInterval: 30s
          scrapeTimeout: 10s

          # podAnnotations:
          #   prometheus.io/scrape: "true"
          #   prometheus.io/port: "8085"
          #   prometheus.io/path: "/metrics"

          extraEnvVars:
            - name: velero.io/csi-volumesnapshot-class
              value: "true" 

          serviceMonitor:
            autodetect: true
            enabled: true

          nodeAgentPodMonitor:
            autodetect: true
            enabled: true

          prometheusRule:
            autodetect: true
            enabled: true

and this is what we get in prometheus so far: image

Thanks for any hints

allenxu404 commented 1 year ago

Based on the above prometheus metrics you provided, it appears that you were requesting the metrics from Velero pod but the pod volume related metrics are collected by node agent pod. So you could try changing the prometheus server address to node agent pod instead.

Please notice that currently there is an issue with prometheus server running in node agent pod: https://github.com/vmware-tanzu/velero/issues/6792 and it has been fixed lately. So please use the main image to verify that.

weshayutin commented 1 year ago

@mpryc FYI

fredgate commented 1 year ago

There is a bug with metrics port of node agent, but even if we manually query the node agent's metrics with curl, there are a lot of metrics but not those relating to the data mover (https://github.com/vmware-tanzu/velero/blob/b85dc271efe28a31a0a2886f3e7cbc83e7219ecd/pkg/metrics/metrics.go#L69C1-L75)

Node agent response : node_agent_metrics.txt

ivomarino commented 1 year ago

thanks @fredgate, any chance this will be fixed soon?

allenxu404 commented 1 year ago

Yes, will look into it and fix it in v1.13 if there is any issue with it.

allenxu404 commented 11 months ago

@fredgate @ivomarino After further investigation, I determined that this issue was still being caused by https://github.com/vmware-tanzu/velero/issues/6792. Specifically, if the node agent server does not properly expose the Prometheus ports, then the Prometheus server will be unable to scrape metrics from the node agent.

The PR to fix this issue has already been cherry-picked into the release-1.12.2 branch. Once the release-1.12.2 version is published, you should be able to verify that this port exposure problem has been resolved.

As you can see below, the metrics are correctly collected by Prometheus server in my local env:

image
github-actions[bot] commented 9 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

github-actions[bot] commented 9 months ago

This issue was closed because it has been stalled for 14 days with no activity.

WRKT commented 4 months ago

I have a question, should the value nodeAgentPodMonitor enabled to scrape those metrics ?