Closed ivomarino closed 9 months ago
Based on the above prometheus metrics you provided, it appears that you were requesting the metrics from Velero pod but the pod volume related metrics are collected by node agent pod. So you could try changing the prometheus server address to node agent pod instead.
Please notice that currently there is an issue with prometheus server running in node agent pod: https://github.com/vmware-tanzu/velero/issues/6792 and it has been fixed lately. So please use the main image to verify that.
@mpryc FYI
There is a bug with metrics port of node agent, but even if we manually query the node agent's metrics with curl, there are a lot of metrics but not those relating to the data mover (https://github.com/vmware-tanzu/velero/blob/b85dc271efe28a31a0a2886f3e7cbc83e7219ecd/pkg/metrics/metrics.go#L69C1-L75)
Node agent response : node_agent_metrics.txt
thanks @fredgate, any chance this will be fixed soon?
Yes, will look into it and fix it in v1.13 if there is any issue with it.
@fredgate @ivomarino After further investigation, I determined that this issue was still being caused by https://github.com/vmware-tanzu/velero/issues/6792. Specifically, if the node agent server does not properly expose the Prometheus ports, then the Prometheus server will be unable to scrape metrics from the node agent.
The PR to fix this issue has already been cherry-picked into the release-1.12.2 branch. Once the release-1.12.2 version is published, you should be able to verify that this port exposure problem has been resolved.
As you can see below, the metrics are correctly collected by Prometheus server in my local env:
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.
This issue was closed because it has been stalled for 14 days with no activity.
I have a question, should the value nodeAgentPodMonitor enabled to scrape those metrics ?
We see data in Grafana for backups but not for the DataMover part:
The DataMover is working fine and copies data to S3 using kopia:
It seems this part of the dashboard expects metrics like
round(sum(increase(podVolume_data_upload_success_total{node=~"$fsb_node"}[1h])))
which we don't have in prometheus. This are the helm values so far for velero:and this is what we get in prometheus so far:
Thanks for any hints