Closed multi-io closed 1 year ago
Same thing happens with the ark_backup_failure_total metric.
Hi, I really love velero. It does really great for its principal purpose : backups and restores.
But, exposing nothing at all is better than incorrect metrics. Can you, either :
Thanks.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing the stale issue.
Velero still has this issue. I think it should be reopen.
The issue is still valid for the latest version!
@eleanor-millman can you re-open?
Sure!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Bump since this is still an issue.
Really interested by a fix for this issue as well, for years now. I'm considering writing a new exporter but that would be counterproductive, thus I lack golang skills to properly improve Velero's one with confidence.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is there anyone with a possible fix? This keeps spamming our monitoring.
The workourand is still to delete velero pods. Be sure to run it when there are no backups.
up
Still a problem :(
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Closing the stale issue.
@eleanor-millman this issue is still an open bug - can you re-open and ask stale-bot to not close it, just because no one has solved it yet?
We have the same problem, please reopen
Just faced the same issue. Deleting the pod works but it's a workaround, can it be re opened?
Currently, it is impossible to effectively monitor the status of backups. The important thing that should be fixed.
@Neurobion, Since the original problem has been reported quite a while ago, would you mind describing it again after testing with latest Velero? We at CloudCasa will see if we can provide a fix if the problem still exists.
I use tag: v1.11.0 + velero-plugin-for-microsoft-azure:v1.6.0 and deploy through Helm. Everything works as it should, but at the moment of redeploying or deleting a specific schedule, this metric still returns it, which is not correct for monitoring purposes. I only need backup information only for existing schedules.
Jiris-MacBook-Pro:velero xx$ velero get backup
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
man-backup Completed 0 0 2023-06-21 15:51:17 +0200 CEST 22d default <none>
velero-1687355039-XX-20230627103052 Completed 0 0 2023-06-27 12:30:52 +0200 CEST 28d default <none>
velero-1687355039-XX-20230626103051 Completed 0 0 2023-06-26 12:30:51 +0200 CEST 27d default <none>
velero-XX-pv-20230628103002 Completed 0 0 2023-06-28 12:30:02 +0200 CEST 29d default <none>
Jiris-MacBook-Pro:velero xx$ velero get schedules
NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR PAUSED
velero-XX-pv Enabled 2023-06-28 09:32:51 +0200 CEST 30 10 * * * 720h0m0s 22h ago <none> false
velero_backup_last_successful_timestamp
return:
velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-1687355039-XX", service="velero"} | 1687861964
velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-XX-pv", service="velero"} | 1687948312
In my opinion velero-1687355039-XX
has nothing to do there because the schedule no longer exists.
And in case of set alerting ((time() - velero_backup_last_successful_timestamp) / 60 / 60 > 24
) it doesn't work as it should.
PS: If that is not the goal of this metric, then another metric would be useful that would only take into account backups according to the current list of schedules.
Agree with @Neurobion, I set up alerts recently for Velero backups, just to find out that they don't work as expected. We need a fix, as these metrics are not reliable for monitoring purposes.
I've been looking for a workaround to this issue for hours. I checked the code (be aware I can barely understand Golang) and AFAIK this problem can be attributed to the backup finalizer controller, these lines in particular:
backupScheduleName := backupRequest.GetLabels()[velerov1api.ScheduleNameLabel]
switch backup.Status.Phase {
case velerov1api.BackupPhaseFinalizing:
backup.Status.Phase = velerov1api.BackupPhaseCompleted
r.metrics.RegisterBackupSuccess(backupScheduleName)
r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc)
case velerov1api.BackupPhaseFinalizingPartiallyFailed:
backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed
r.metrics.RegisterBackupPartialFailure(backupScheduleName)
r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure)
}
backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()}
recordBackupMetrics(log, backup, outBackupFile, r.metrics, true)
It seems that the finalizer controller is the starting point when it comes to backup deletion. Backup CRDs get marked with a finalizer and subsequentially a deletionTimestamp
in Kubernetes, and then Velero starts the backup deletion process. The backup deletion itself could still be unsuccessful or get stuck somehow - this happened to me just yesterday - but the controller will still mark what actually is a deletion attempt as a successful backup. It doesn't make sense.
It may very well be that they did this instead of implementing new metrics and updating the Kubernetes CRDs to support these new backup status phases. Nonetheless, because this is the current behavior, the metrics being updated by the finalizer controller are rendered essentially useless.
In my opinion, it would be great if the team could release a hotfix, deleting the lines pointed out above as most of us desperately need this fixed to be able to set up alerts based on these metric sets. If they are accepting pull requests, some of us would be willing to contribute to get this fixed as soon as possible.
Thanks @Neurobion. We (at CloudCasa) will try to fix the issue.
Currently, Velero does not clear the Prometheus counter after the schedule gets deleted. In the schedule's reconciler, we can add the following code to delete the counters, but that's not sufficient. Because these counters are updated from the backup reconcilers too.
if c, ok := m.metrics[backupAttemptTotal].(*prometheus.CounterVec); ok {
c.DeleteLabelValues(scheduleName)
}
Hello @draghuram, any news on this issue?
@nilesh-akhade created a PR that we are internally reviewing before submitting to Velero. Will post here when it is ready.
What steps did you take and what happened:
Create a schedule, let it perform a few backups, then delete the schedule.
The /metrics endpoint keeps reporting the velero_backup_success_total{schedule=""} metric, even after all the backups have expired, until velero itself is restarted.
What did you expect to happen:
The metric should vanish when the schedule is deleted or after all its backups have expired, or it should stay around indefinitely (which I wouldn't prefer). In any case, the fact that it goes away when the velero pod is restarted suggests that this an artifact of the implementation rather than desired behaviour, and it does make it harder to implement certain alerting rules.