vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

GCP plugin exits silently and leaves orphan snapshots behind #5641

Closed tuusberg closed 1 year ago

tuusberg commented 1 year ago

What steps did you take and what happened:

We are running Velero in multiple environments and recently noticed that the number of GCP Compute Engine snapshots is constantly growing in one of them. Velero does not delete some of the snapshots and we believe that it has to do with the fact that the plugin we are using is getting restarted constantly:

time="2022-11-28T19:26:10Z" level=debug msg="looking for plugin in registry" controller=backup-sync kind=ObjectStore logSource="pkg/plugin/clientmgmt/manager.go:113" name=velero.io/gcp
time="2022-11-28T19:26:10Z" level=debug msg="creating new restartable plugin process" command=/plugins/velero-plugin-for-gcp controller=backup-sync kind=ObjectStore logSource="pkg/plugin/clientmgmt/manager.go:128" name=velero.io/gcp
time="2022-11-28T19:26:10Z" level=debug msg="starting plugin" args="[/plugins/velero-plugin-for-gcp --log-level trace]" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-gcp
time="2022-11-28T19:26:10Z" level=debug msg="plugin started" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-gcp pid=289
time="2022-11-28T19:26:10Z" level=debug msg="waiting for RPC address" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-gcp
time="2022-11-28T19:26:10Z" level=debug msg="Setting log level to TRACE" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.7.1/pkg/plugin/framework/server.go:199" pluginName=velero-plugin-for-gcp
time="2022-11-28T19:26:10Z" level=debug msg="plugin address" address=/tmp/plugin2936884976 cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" network=unix pluginName=velero-plugin-for-gcp
time="2022-11-28T19:26:10Z" level=debug msg="using plugin" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" version=2
time="2022-11-28T19:26:10Z" level=debug msg="waiting for stdio data" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
time="2022-11-28T19:26:10Z" level=debug msg="received EOF, stopping recv loop" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
time="2022-11-28T19:26:10Z" level=debug msg="plugin process exited" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-gcp pid=289
time="2022-11-28T19:26:10Z" level=debug msg="plugin exited" cmd=/plugins/velero-plugin-for-gcp controller=backup-sync logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75"

What did you expect to happen:

That error message should not be there (we don't see it happening in other envs) and all snapshots should be deleted when their backups expire.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

It only happens in one environment, all other envs are fine.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

tuusberg commented 1 year ago

I forgot to add that we tried enabling logging in the plugin by rebuilding its image but it does not show any logs (tried both fmt.Print and logrus logger.

Lyndon-Li commented 1 year ago

@tuusberg Would you provide Velero log bundle? We need more info to troubleshoot the problem. Thanks.

blackpiglet commented 1 year ago

@tuusberg Plugins log are printed into Velero server's log file. Just use velero debug command to collect the all the information used for debug, then upload the generated tarbarll file.

If the plugins failed to clean snapshots, I think that should happen during backup deletion. Please check whether there are some error information in the DeleteBackupRequest's status.

tuusberg commented 1 year ago

hey @Lyndon-Li @blackpiglet. I am attaching a velero debug bundle: bundle-2022-12-05-15-03-42.tar.gz

tuusberg commented 1 year ago

we don't see any logs coming from the plugin because it constantly gets killed. The bundle also does not have any plugin logs.

blackpiglet commented 1 year ago

Plugins are not running as a Daemon process like the Velero server. They are binaries called by Velero server. After the task finished, the plugins will exist,so there are some logs about plugins exist, and IMO, that's normal. It doesn't mean the plugin process is caught by something like crash.

I found some logs related to GCP plugin from the debug bundle. Most of the logs are printed by the Go plugin framework github.com/hashicorp/go-plugin. They are printed when plugin exists. From the log, I hadn't find any thing that can make GCP plugins fail to delete snapshot.

time="2022-12-05T18:03:53Z" level=debug msg="received EOF, stopping recv loop" backup=velero/velero-hourly-20221205180037 cmd=/velero err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
time="2022-12-05T18:03:53Z" level=debug msg="plugin process exited" backup=velero/velero-hourly-20221205180037 cmd=/velero logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/velero pid=54201
time="2022-12-05T18:03:53Z" level=debug msg="plugin exited" backup=velero/velero-hourly-20221205180037 cmd=/velero logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75"
time="2022-12-05T18:03:53Z" level=debug msg="received EOF, stopping recv loop" backup=velero/velero-hourly-20221205180037 cmd=/plugins/velero-plugin-for-gcp err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
time="2022-12-05T18:03:53Z" level=debug msg="plugin process exited" backup=velero/velero-hourly-20221205180037 cmd=/plugins/velero-plugin-for-gcp logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-gcp pid=54210
time="2022-12-05T18:03:53Z" level=debug msg="plugin exited" backup=velero/velero-hourly-20221205180037 cmd=/plugins/velero-plugin-for-gcp logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75"
tuusberg commented 1 year ago

@blackpiglet thank you for your response. We suspect that backups are left behind because of this error, the plugin gets restarted constantly only in this particular environment. Could it be the reason we are seeing hundreds of thousands snapshots left untouched after Velero backups are deleted?

blackpiglet commented 1 year ago

No. I don't think these errors will make plugins fail to delete related snapshot. These logs are normal for plugin existing. There should be other reason if snapshots are not deleted during backup deletion, and if there is the case, there should be some error printed, when backup is deleted.

Could you try to create a backup with snapshot, then delete it to verify whether related snapshots are left behind, and collect the debug bundle?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.