Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning

vladislav-curvetech commented 4 months ago

I am experiencing issues with Velero where most of the metrics are always zero, and basic Prometheus metrics are not functioning correctly. This issue significantly affects our ability to monitor the backup status and reliability.

A few problematic metrics:

velero_backup_failure_total
velero_backup_items_errors
velero_backup_partial_failure_total
velero_backup_warning_total
velero_backup_attempt_total

These metrics are crucial for us to monitor the health and status of our backup operations, but they consistently report zero values, which is not accurate.

Expected Behavior: The above metrics should provide accurate and non-zero values reflecting the actual state of Velero backups.

Environment: Velero version: 1.13.0 Kubernetes version: 1.28 Cloud provider: AWS EKS

Additional Context: Any insights or solutions to this issue would be greatly appreciated as these metrics are critical for our backup monitoring and alerting.

Thank you for your assistance!

komljen commented 3 months ago

Did you find the issue?

kaovilai commented 3 months ago

Check for any pod restarts.. these metrics IIUC are incremental as new backup/restores are processed. Velero does not list all existing backups prior to its startup to count attempt/failure totals.

vladislav-curvetech commented 2 months ago

Thank you, Kaovilai, for your participation. Yes, the pod sometimes restarts for an unknown reason. Before the restart, I see just one warning:

level=warning msg="active indexes ....blabla.....12b-c1] deletion watermark 2024-08-10 20:30:46 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error.

Did I understand correctly that if the pod restarts, the metrics important to me are reset?

kaovilai commented 2 months ago

yes

kaovilai commented 2 months ago

One reason is, velero sync backup from object storage (could be from a different cluster) to cluster.

Many of those will have status of completed.

If metrics count completed backups in cluster, it would overcount what this cluster has actually completed.

vmware-tanzu / velero

Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951