vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.81k stars 1.41k forks source link

velero_backup_success_total metric keeps being reported for deleted schedules #1333

Closed multi-io closed 1 year ago

multi-io commented 5 years ago

What steps did you take and what happened:

Create a schedule, let it perform a few backups, then delete the schedule.

The /metrics endpoint keeps reporting the velero_backup_success_total{schedule=""} metric, even after all the backups have expired, until velero itself is restarted.

What did you expect to happen:

The metric should vanish when the schedule is deleted or after all its backups have expired, or it should stay around indefinitely (which I wouldn't prefer). In any case, the fact that it goes away when the velero pod is restarted suggests that this an artifact of the implementation rather than desired behaviour, and it does make it harder to implement certain alerting rules.

maberny commented 5 years ago

Same thing happens with the ark_backup_failure_total metric.

h4wkmoon commented 4 years ago

Hi, I really love velero. It does really great for its principal purpose : backups and restores.

But, exposing nothing at all is better than incorrect metrics. Can you, either :

Thanks.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

Closing the stale issue.

h4wkmoon commented 3 years ago

Velero still has this issue. I think it should be reopen.

sbkg0002 commented 2 years ago

The issue is still valid for the latest version!

@eleanor-millman can you re-open?

eleanor-millman commented 2 years ago

Sure!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sbkg0002 commented 2 years ago

Bump since this is still an issue.

guillaumefenollar commented 2 years ago

Really interested by a fix for this issue as well, for years now. I'm considering writing a new exporter but that would be counterproductive, thus I lack golang skills to properly improve Velero's one with confidence.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sbkg0002 commented 2 years ago

Is there anyone with a possible fix? This keeps spamming our monitoring.

h4wkmoon commented 2 years ago

The workourand is still to delete velero pods. Be sure to run it when there are no backups.

LuckySB commented 2 years ago

up

KlavsKlavsen commented 2 years ago

Still a problem :(

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

Closing the stale issue.

KlavsKlavsen commented 1 year ago

@eleanor-millman this issue is still an open bug - can you re-open and ask stale-bot to not close it, just because no one has solved it yet?

cwrau commented 1 year ago

We have the same problem, please reopen

lorenzomorandini commented 1 year ago

Just faced the same issue. Deleting the pod works but it's a workaround, can it be re opened?

Neurobion commented 1 year ago

Currently, it is impossible to effectively monitor the status of backups. The important thing that should be fixed.

draghuram commented 1 year ago

@Neurobion, Since the original problem has been reported quite a while ago, would you mind describing it again after testing with latest Velero? We at CloudCasa will see if we can provide a fix if the problem still exists.

Neurobion commented 1 year ago

I use tag: v1.11.0 + velero-plugin-for-microsoft-azure:v1.6.0 and deploy through Helm. Everything works as it should, but at the moment of redeploying or deleting a specific schedule, this metric still returns it, which is not correct for monitoring purposes. I only need backup information only for existing schedules.

Jiris-MacBook-Pro:velero xx$ velero get backup
NAME                                                        STATUS      ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
man-backup                                                  Completed   0        0          2023-06-21 15:51:17 +0200 CEST   22d       default            <none>
velero-1687355039-XX-20230627103052   Completed   0        0          2023-06-27 12:30:52 +0200 CEST   28d       default            <none>
velero-1687355039-XX-20230626103051   Completed   0        0          2023-06-26 12:30:51 +0200 CEST   27d       default            <none>
velero-XX-pv-20230628103002              Completed   0        0          2023-06-28 12:30:02 +0200 CEST   29d       default            <none>

Jiris-MacBook-Pro:velero xx$ velero get schedules
NAME                              STATUS    CREATED                          SCHEDULE      BACKUP TTL   LAST BACKUP   SELECTOR   PAUSED
velero-XX-pv   Enabled   2023-06-28 09:32:51 +0200 CEST   30 10 * * *   720h0m0s     22h ago       <none>     false

velero_backup_last_successful_timestamp return:

velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-1687355039-XX", service="velero"} | 1687861964
velero_backup_last_successful_timestamp{endpoint="http-monitoring", instance="10.142.118.80:8085", job="velero", namespace="velero", pod="velero-69d5544f8f-pbwcc", schedule="velero-XX-pv", service="velero"} | 1687948312

In my opinion velero-1687355039-XX has nothing to do there because the schedule no longer exists.

And in case of set alerting ((time() - velero_backup_last_successful_timestamp) / 60 / 60 > 24) it doesn't work as it should.

PS: If that is not the goal of this metric, then another metric would be useful that would only take into account backups according to the current list of schedules.

jmuleiro commented 1 year ago

Agree with @Neurobion, I set up alerts recently for Velero backups, just to find out that they don't work as expected. We need a fix, as these metrics are not reliable for monitoring purposes.

jmuleiro commented 1 year ago

I've been looking for a workaround to this issue for hours. I checked the code (be aware I can barely understand Golang) and AFAIK this problem can be attributed to the backup finalizer controller, these lines in particular:

backupScheduleName := backupRequest.GetLabels()[velerov1api.ScheduleNameLabel]
switch backup.Status.Phase {
case velerov1api.BackupPhaseFinalizing:
    backup.Status.Phase = velerov1api.BackupPhaseCompleted
    r.metrics.RegisterBackupSuccess(backupScheduleName)
    r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusSucc)
case velerov1api.BackupPhaseFinalizingPartiallyFailed:
    backup.Status.Phase = velerov1api.BackupPhasePartiallyFailed
    r.metrics.RegisterBackupPartialFailure(backupScheduleName)
    r.metrics.RegisterBackupLastStatus(backupScheduleName, metrics.BackupLastStatusFailure)
}
backup.Status.CompletionTimestamp = &metav1.Time{Time: r.clock.Now()}
recordBackupMetrics(log, backup, outBackupFile, r.metrics, true)

It seems that the finalizer controller is the starting point when it comes to backup deletion. Backup CRDs get marked with a finalizer and subsequentially a deletionTimestamp in Kubernetes, and then Velero starts the backup deletion process. The backup deletion itself could still be unsuccessful or get stuck somehow - this happened to me just yesterday - but the controller will still mark what actually is a deletion attempt as a successful backup. It doesn't make sense.

It may very well be that they did this instead of implementing new metrics and updating the Kubernetes CRDs to support these new backup status phases. Nonetheless, because this is the current behavior, the metrics being updated by the finalizer controller are rendered essentially useless.

In my opinion, it would be great if the team could release a hotfix, deleting the lines pointed out above as most of us desperately need this fixed to be able to set up alerts based on these metric sets. If they are accepting pull requests, some of us would be willing to contribute to get this fixed as soon as possible.

draghuram commented 1 year ago

Thanks @Neurobion. We (at CloudCasa) will try to fix the issue.

nilesh-akhade commented 1 year ago

Currently, Velero does not clear the Prometheus counter after the schedule gets deleted. In the schedule's reconciler, we can add the following code to delete the counters, but that's not sufficient. Because these counters are updated from the backup reconcilers too.

if c, ok := m.metrics[backupAttemptTotal].(*prometheus.CounterVec); ok {
    c.DeleteLabelValues(scheduleName)
}
jmuleiro commented 1 year ago

Hello @draghuram, any news on this issue?

draghuram commented 1 year ago

@nilesh-akhade created a PR that we are internally reviewing before submitting to Velero. Will post here when it is ready.