vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.63k stars 1.39k forks source link

Metrics for backup repository maintenance job #7760

Open reasonerjt opened 5 months ago

reasonerjt commented 5 months ago

It would be good enhancement to add metrics to track the successful and failed backup repository maintenance job. Esp. when there's misconfiguration causing the job fails to be created, there will only be a log message in velero pod.

SwiftLion23 commented 3 months ago

To enhance the monitoring and reliability of our backup repository maintenance job for the streaming application, I propose adding metrics to track both successful and failed job executions. Currently, if there's a misconfiguration causing the job to fail, only a log message is generated within the Velero pod, making it difficult to promptly detect and address the issue. By implementing these metrics, we can achieve better visibility and faster response times to any issues that arise.

Specifically, I suggest the following:

Success Metrics: Track the number of successful job executions, indicating that the backup repository maintenance has been performed correctly. Failure Metrics: Track the number of failed job executions to quickly identify and troubleshoot issues. Failure Reasons: Capture and categorize common failure reasons, such as misconfigurations or connectivity issues, to provide detailed insights into the nature of the failures. These metrics should be exposed to our monitoring system (e.g., Prometheus) to enable alerting and dashboarding. This will greatly improve our ability to ensure that backup repositories are maintained correctly for our streaming application and reduce the risk of data loss due to undetected failures. By having these metrics, we can ensure the continuous and reliable operation of our streaming services, maintaining high availability and performance for our users.