vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.58k stars 1.39k forks source link

Backup log getting corrupted consistently #5115

Closed vineetsingh5 closed 1 year ago

vineetsingh5 commented 2 years ago

What steps did you take and what happened: [A clear and concise description of what the bug is, and what commands you ran.)

I am creating cluster backup including pv using velero(v1.9.0) + csi (velero-plugin-for-csi:v0.3.0). Backup is getting successful but backup log format if getting corrupted with error invalid compressed data--crc. I am using plugin (https://github.com/vmware-tanzu/velero-plugin-example/blob/main/internal/plugin/objectstoreplugin.go ) mention here in place of s3 or blob storage Same plugin was working with velero 1.8.x version but started failing post velero upgrade

I am able to extract other stored file except backup logs so this should not be plugin issue

I am also seeing this error message level=debug msg="received EOF, stopping recv loop" cmd=/velero err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio

What did you expect to happen:

Backup log file should be compress with proper format and not throw above error

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

backup.log

laugmanuel commented 2 years ago

I'm using Velero 1.9.0 as well but together with Velero-Plugin-for-AWS 1.5.0 with exactly the same problem: velero backup logs <name> and velero backup describe <name> --details hangs endlessly. When enabling Velero debug logging, I see the following:

velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="Getting DownloadRequest" controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/controller/download_request_controller.go:58"
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="looking for plugin in registry" controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 kind=ObjectStore logSource="pkg/plugin/clientmgmt/manager.go:113" name=velero.io/aws
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="creating new restartable plugin process" command=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 kind=ObjectStore logSource="pkg/plugin/clientmgmt/manager.go:128" name=velero.io/aws
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="starting plugin" args="[/plugins/velero-plugin-for-aws --log-level debug]" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="plugin started" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=819
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="waiting for RPC address" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="Setting log level to DEBUG" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.7.1/pkg/plugin/framework/server.go:199" pluginName=velero-plugin-for-aws
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="using plugin" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" version=2
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="plugin address" address=/tmp/plugin3323864385 cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" network=unix pluginName=velero-plugin-for-aws
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="waiting for stdio data" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="received EOF, stopping recv loop" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="plugin process exited" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=819
velero-65fcb9b594-s7jh4 velero time="2022-07-12T06:33:24Z" level=debug msg="plugin exited" cmd=/plugins/velero-plugin-for-aws controller=download-request downloadRequest=velero/default-20220708000006-993f44ce-d7d2-4d15-8f0b-21e2e2f293b1 logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75"

With Velero 1.8.1 and AWS-Plugin 1.4.1 everything worked fine.

reasonerjt commented 2 years ago

@vineetsingh5 so you mean you are using a customized plugin to download the logs? Please check the go.mod history, b/c we bumped up some dependencies of the plugin framework, I feel that may have caused this problem.

@laugmanuel I used this command when I was testing velero v1.9 and I don't recall I saw any problems like this. Could you clarify if you installed velero using the official build? Could you please also check if this problem only happens to backup you created using velero v1.8.x or also in v1.9.0?

vineetsingh5 commented 2 years ago

@reasonerjt yes I am using example plugin here https://github.com/vmware-tanzu/velero-plugin-example/blob/main/internal/plugin/objectstoreplugin.go I checked other files like podvolumebackups.json.gz/resource-list.json.gz .. and only log file is getting corrupted

Also, I checked go.mod file and match all dependencies with dependencies available in lated velero go.mod file. Also I had regenerated go.sum file using command go clean -modcache, go mod tidy . But I am seeing same message.

I have attached both file for ref

tmp.zip

laugmanuel commented 2 years ago

@reasonerjt, the same command also fails for backups created with Velero v1.9.0. We are running the official images (velero/velero:v1.9.0 and velero/velero-plugin-for-aws:v1.5.0). I just had a look into the target S3 bucket and checked the latest logs file. After gunzip, I can read the file and it seems to be valid and contain all relevant information (log of every resource backed up). It also ends with Backed up a total of 2534 items and seems to be complete.

EDIT: We can ignore my report as I found the issue: the CRDs were not properly updated. Creating backups still worked fine, but getting the logs did not. After applying the CRDs again and checking the result, I can retrieve logs fine. Sorry for the confusion.

rajivml commented 2 years ago

any thoughts on how to get away with this, only the backup log is getting corrupted rest all logs were intact

reasonerjt commented 2 years ago

@vineetsingh5 It's difficult for me to debug b/c you are using a custom plugin on your end. The error message looks very much like you are using a different version of go-plugin library than what's used in velero v1.9 (we used to see it in our development as well)

If you have verified that this change in commit https://github.com/vmware-tanzu/velero-plugin-for-aws/commit/ec7bdd82687fb4b8fbdad6940c6626c3546d665b has been made to your own plugin, could you please try to reproduce the problem on the fresh install combination of velero v1.9 and velero-plugin-for-aws v1.5?

vineetsingh5 commented 2 years ago

@reasonerjt versions changes in this commit https://github.com/vmware-tanzu/velero-plugin-for-aws/commit/ec7bdd82687fb4b8fbdad6940c6626c3546d665b and plugins are same.

I will try other setting and will get back to you

vineetsingh5 commented 2 years ago

@reasonerjt

  1. When is used plugin velero-plugin-for-aws v1.5 and velero v1.9 with minio, I can see log file as well as I can read log using velero backup logs <backup-name>
  2. I also tested velero v1.9 with plugin available https://github.com/replicatedhq/local-volume-provider, I can read log details using velero backup logs <backup-name> but when I tried to check log file stored on NFS server, I got the same error gzip: test02-logs.gz: invalid compressed data--crc error. Not sure if velero used any specific gunzip version or some other format Also I am seeing this message An error occurred: gzip: invalid checksum at the end of log while running velero backup logs <backup-name> Also I can still see this message in velero pod log level=debug msg="received EOF, stopping recv loop" cmd=/velero err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/logrus_adapter.go:75" pluginName=stdio
neelaruban commented 2 years ago

@reasonerjt

I am seeing similar errors with the latest velero/velero-plugin-for-aws:v1.2.0 and velero v1.9.0 and also i am unable to view logs for the backups , i have been always receiving the following error

 velero backup logs wwazirx-uat-eks-bkp-test
An error occurred: file not found

am i , missing anything here as the documentation is a bit very vague on this , also upgrading the loglevels to debug did not help either . I dont think the errors are emitted or its been failed to be captured during a backup failure

reasonerjt commented 2 years ago

@neelaruban You should not use velero/velero-plugin-for-aws:v1.2.0 with velero v1.9.0 https://github.com/vmware-tanzu/velero-plugin-for-aws#compatibility

@vineetsingh5 This plugin uses a different version of go-plugin library https://github.com/replicatedhq/local-volume-provider/blob/main/go.mod#L8 Where velero uses: https://github.com/vmware-tanzu/velero/blob/main/go.mod#L24

chrislinan commented 2 years ago

Hi @reasonerjt I met the same issue, When I try to get logs of a backup, it's corrupted. I am using velero 1.9.0, and velero-plugin-for-aws 1.5.0.

I have no name!@velero-78dc599558-5mbtp:/> ./velero version
Client:
        Version: 1.9.0
        Git commit: 8246a842a38c6312a8b770833958a858c1ac32b1-
Server:
        Version:

when I get logs, the command hangs, and no result shows:

I have no name!@velero-78dc599558-5mbtp:/> ./velero backup logs velero-backup-obj-20220725054542
^C

But the describe command works fine:

I have no name!@velero-78dc599558-5mbtp:/> ./velero backup describe velero-backup-obj-20220725054542
Name:         velero-backup-obj-20220725054542
Namespace:    velero
Labels:       app.kubernetes.io/name=velero
              hc.sap.com/hc-provisioned=True
              helm.sh/chart=hc-disaster-recovery-1.3.3-155.7837e09
              velero.io/schedule-name=velero-backup-obj
              velero.io/storage-location=default
.......

This happens after we change velero version from 1.8 to 1.9. This also happens on restore :

I have no name!@velero-78dc599558-5mbtp:/> ./velero restore get
NAME                                              BACKUP                             STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
velero-backup-obj-20220728081216-20220728084411   velero-backup-obj-20220728081216   Completed   2022-07-28 08:44:11 +0000 UTC   2022-07-28 08:44:26 +0000 UTC   0        4          2022-07-28 08:44:11 +0000 UTC   <none>
I have no name!@velero-78dc599558-5mbtp:/> ./velero restore logs velero-backup-obj-20220728081216-20220728084411
<command hangs here>

the describe command also hangs for restore, although it works for backup. When I try to get plugins, it shows nothing:

I have no name!@velero-78dc599558-5mbtp:/> ./velero plugin get
I have no name!@velero-78dc599558-5mbtp:/> 

And when I check velero log, I found something strange:

time="2022-07-28T10:21:37Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-07-28T10:21:37Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:172"
I0728 10:22:05.756648       1 request.go:665] Waited for 1.044646696s due to client-side throttling, not priority and fairness, request: GET:https://api.i302530-haas.hc-dev.internal.canary.k8s.ondemand.com:443/apis/node.k8s.io/v1?timeout=32s
time="2022-07-28T10:22:12Z" level=info msg="Processing new ServerStatusRequest" controller=server-status-request logSource="pkg/controller/server_status_request_controller.go:89" phase= serverStatusRequest=velero/velero-cli-rsklz
time="2022-07-28T10:22:30Z" level=info msg="Processing new ServerStatusRequest" controller=server-status-request logSource="pkg/controller/server_status_request_controller.go:89" phase= serverStatusRequest=velero/velero-cli-nvn7z
time="2022-07-28T10:22:37Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-07-28T10:22:37Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-07-28T10:22:37Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:172"
time="2022-07-28T10:23:00Z" level=info msg="Processing new ServerStatusRequest" controller=server-status-request logSource="pkg/controller/server_status_request_controller.go:89" phase= serverStatusRequest=velero/velero-cli-2cbwd
time="2022-07-28T10:23:09Z" level=info msg="Processing new ServerStatusRequest" controller=server-status-request logSource="pkg/controller/server_status_request_controller.go:89" phase= serverStatusRequest=velero/velero-cli-2ht77
time="2022-07-28T10:23:12Z" level=info msg="Processing new ServerStatusRequest" controller=server-status-request logSource="pkg/controller/server_status_request_controller.go:89" phase= serverStatusRequest=velero/velero-cli-ht7pg
t

An error message(level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1)) goes after this log: BackupStorageLocations is valid, marking as available. Could you please help with this? Thanks a lot.

reasonerjt commented 2 years ago

@chrislinan I don't think you are seeing the same issue as @vineetsingh5 It more looks like a CRD issue like mentioned in https://github.com/vmware-tanzu/velero/issues/5115#issuecomment-1182970565

The error in the log also seemed to be caused by CRD in-consistent.

If you wanna upgrade without uninstalling velero, please make sure the CRDs are also updated: https://velero.io/docs/v1.9/upgrade-to-1.9/

If you continue seeing issue I would suggest you open a new issue, b/c the error you see is different from OP, we don't want to discuss different problems under one issue.

chrislinan commented 2 years ago

Hi, @reasonerjt Thanks for your help. After I upgrade CRD to the latest version, the issue is resolved.