Closed julienroussel00 closed 1 year ago
@julienroussel00, backup_controller.go:281, there may be some network problems with uploading backup files to the object storage. you can check the bsl status with the command velero backup-location get
And you can upload the backup logs for me to track the issue with command velero bacup logs velero-backup6h-20220120171346
Also if you can provide the velero pod logs is much better with command 'kubectl logs -n kube-system velero-xxx`
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Close for now. If any update, feel free to reopen
Hi, I believe we have the same problem:
.... A LOT OF info-msg..... AND AT THE END: time="2022-07-27T00:02:22Z" level=info msg="Skipping item because it's already been backed up." backup=velero/velero-full-backup-20220727000056 logSource="pkg/backup/item_backupper.go:117" name=policyreports.wgpolicyk8s.io namespace= resource=customresourcedefinitions.apiextensions.k8s.io time="2022-07-27T00:02:22Z" level=info msg="Backed up a total of 6860 items" backup=velero/velero-full-backup-20220727000056 logSource="pkg/backup/backup.go:405" progress= time="2022-07-27T00:02:22Z" level=info msg="Setting up backup store to persist the backup" backup=velero/velero-full-backup-20220727000056 logSource="pkg/controller/backup_controller.go:720" time="2022-07-27T00:02:23Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20220727000056 controller=backup logSource="pkg/controller/backup_controller.go:730" time="2022-07-27T00:02:23Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unknown desc = EOF, rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin1425071790: connect: connection refused\"]" key=velero/velero-full-backup-20220727000056 logSource="pkg/controller/backup_controller.go:298"
Here our schedule where you can see when it happens:
velero-full-backup-20220727000056 Failed 0 0 2022-07-27 02:00:56 +0200 CEST 13d default <none> velero-full-backup-20220726000055 Completed 0 0 2022-07-26 02:00:55 +0200 CEST 12d default <none> velero-full-backup-20220725000054 Completed 0 0 2022-07-25 02:00:55 +0200 CEST 11d default <none> velero-full-backup-20220724000053 Failed 0 1 2022-07-24 02:00:53 +0200 CEST 10d default <none> velero-full-backup-20220723000052 PartiallyFailed 1 0 2022-07-23 02:00:53 +0200 CEST 9d default <none> velero-full-backup-20220722000051 Completed 0 0 2022-07-22 02:00:52 +0200 CEST 8d default <none> velero-full-backup-20220721000047 Completed 0 0 2022-07-21 02:00:47 +0200 CEST 7d default <none> velero-full-backup-20220720000019 Completed 0 0 2022-07-20 02:00:19 +0200 CEST 6d default <none> velero-full-backup-20220719000045 Failed 0 0 2022-07-19 02:00:45 +0200 CEST 5d default <none> velero-full-backup-20220718000040 Failed 0 0 2022-07-18 02:00:40 +0200 CEST 4d default <none> velero-full-backup-20220717000024 Failed 0 0 2022-07-17 02:00:24 +0200 CEST 3d default <none> velero-full-backup-20220716000015 Failed 0 0 2022-07-16 02:00:15 +0200 CEST 2d default <none> velero-full-backup-20220715000014 Completed 0 0 2022-07-15 02:00:14 +0200 CEST 1d default <none> velero-full-backup-20220714000000 Failed 0 0 2022-07-14 02:00:00 +0200 CEST 18h default <none>
Output of velero backup-location:
velero backup-location get, seems good (we can see all the time in the logs, that it is successfully checked: NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT default azure velero-backups Available 2022-07-27 07:32:51 +0200 CEST ReadWrite true
Please let me know if you need more information.
I just found these lines in the logs:
time="2022-07-25T21:03:18Z" level=error msg="Error getting a backup store" backup-storage-location=velero/default controller=backup-storage-location error="rpc error: code = Unknown desc = azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/XXXXXXXXXXXX/providers/Microsoft.Storage/storageAccounts/XXXXXXlistKeys?%24expand=kerb&api-version=2019-06-01: StatusCode=403 -- Original Error: adal: Refresh request failed. Status Code = '403'. Response body: failed to acquire a token using the MSI VM extension, error: MSI not available\n Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fmanagement.azure.com%2F" error.file="/go/src/velero-plugin-for-microsoft-azure/velero-plugin-for-microsoft-azure/object_store.go:217" error.function=main.getStorageAccountKey logSource="pkg/controller/backup_storage_location_controller.go:126"
@blackpiglet @julienroussel00 can you please reopen this issue?
And another error message that could be related:
time="2022-08-02T00:02:05Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20220802000031 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-08-02T00:02:05Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = error reading from server: EOF" key=velero/velero-full-backup-20220802000031 logSource="pkg/controller/backup_controller.go:298"
The backup-files in the (azure) storage-account are also missing.
Here some files are missing: This is a successful backup:
And again some more log-output of the schedule which failed tonight:
time="2022-08-04T00:01:24Z" level=info msg="Backed up a total of 2338 items" backup=velero/velero-full-backup-20220804000025 logSource="pkg/backup/backup.go:405" progress=
time="2022-08-04T00:01:24Z" level=info msg="Setting up backup store to persist the backup" backup=velero/velero-full-backup-20220804000025 logSource="pkg/controller/backup_controller.go:720"
time="2022-08-04T00:01:25Z" level=info msg="Plugin process exited - restarting." backup=velero/velero-full-backup-20220804000025 logSource="pkg/plugin/clientmgmt/restartable_process.go:155"
time="2022-08-04T00:01:25Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-08-04T00:01:25Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-08-04T00:01:26Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20220804000025 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-08-04T00:01:26Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = error reading from server: EOF, rpc error: code = Unavailable desc = connection closed]" key=velero/velero-full-backup-20220804000025 logSource="pkg/controller/backup_controller.go:298"
Hi again,
I tried to reproduce the problem with avctivated debug-logging. After the 8 try I was successful:
backuptest1 Completed 0 0 2022-08-08 15:30:23 +0200 CEST 29d default
I attached the debug-logs from the last 3 tries. Hope this helps velero_debug.log.zip
@matthiasritter Sorry for late response. Recently, Velero team is busy with v1.10 feature developing and v1.9.1 bug fixing. What is the version of Azure plugin you are using with Velero v1.9.0? Looks like Azure plugin restarting caused backup failure.
Hi @blackpiglet, I'm a colleague of @matthiasritter. We run Azure plugin v1.5.0 and velero 1.9.0.
I thought this issue was caused by azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request
.
According to your recent uploaded log, looks like this issue happens intermittently after Azure plugin running for a while.
This reminds me another issue #4863. In that issue, CSI plugin also failed after running for a while, and the reason is the plugin framework package used by Velero's version didn't consume the os.Stdout and os.Stderr. After a while, eventually the CSI plugin was blocked since it reached the pipe's max capacity with the accumulated warning messages, but CSI plugin will not return error, so that part is different, but I think that is because the implementation is not same in the two plugins.
Although this issue was fixed in Velero, but Velero plugins still import the v1.7.1 version of Velero, so the fix is not included yet. I think need to update the plugins go.mod file to use new version of Velero.
After taking another look, I found the HarshiCorp go-plugin version is already updated for this version of Azure plugin. Currently I couldn't find the reason why plugin process exited during backup metadata uploading. @ywk253100 Could you give some insight?
Hi @blackpiglet @ywk253100 ,
Is there an update for this? The error still occurs with us.
failureReason: '[rpc error: code = Unavailable desc = error reading from server: EOF, rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin3466263320: connect: connection refused"]'
Please feel free to let me know if you need more info.
Hi @blackpiglet @ywk253100 we are still seeing this error:
time="2022-10-10T00:01:26Z" level=info msg="Backed up a total of 2645 items" backup=velero/velero-full-backup-20221010000049 logSource="pkg/backup/backup.go:405" progress=
time="2022-10-10T00:01:26Z" level=info msg="Setting up backup store to persist the backup" backup=velero/velero-full-backup-20221010000049 logSource="pkg/controller/backup_controller.go:720"
time="2022-10-10T00:01:28Z" level=info msg="Plugin process exited - restarting." backup=velero/velero-full-backup-20221010000049 logSource="pkg/plugin/clientmgmt/restartable_process.go:155"
time="2022-10-10T00:01:28Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-10-10T00:01:28Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-10-10T00:01:28Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20221010000049 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-10-10T00:01:28Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = error reading from server: EOF, rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin586144313: connect: connection refused\"]" key=velero/velero-full-backup-20221010000049 logSource="pkg/controller/backup_controller.go:298"
Is there any update?
Is the backup ended up with Failed status? Could you post the backup's Status section?
Hi @blackpiglet , yes, as described in the log-output "backup completed" and then "backup failed":
velero-full-backup-20221010000049 Failed 0 0 2022-10-10 02:00:49 +0200 CEST 13d default <none>
❯ velero describe backup velero-full-backup-20221010000049
Name: velero-full-backup-20221010000049
Namespace: velero
Labels: app.kubernetes.io/instance=velero
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=velero
helm.sh/chart=velero-2.30.1
velero.io/schedule-name=velero-full-backup
velero.io/storage-location=default
Annotations: helm.sh/hook=post-install,post-upgrade,post-rollback
helm.sh/hook-delete-policy=before-hook-creation
velero.io/source-cluster-k8s-gitversion=v1.23.12
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=23
Phase: Failed (run `velero backup logs velero-full-backup-20221010000049` for more information)
Errors: 0
Warnings: 0
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: true
TTL: 336h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2022-10-10 02:00:49 +0200 CEST
Completed: 2022-10-10 02:01:26 +0200 CEST
Expiration: 2022-10-24 02:00:49 +0200 CEST
Total items to be backed up: 2645
Items backed up: 2645
Velero-Native Snapshots: 2 of 2 snapshots completed successfully (specify --details for more information)
I have a very similar problem when trying to make a backup of an entire aks cluster (over 14K items). But when I apply --include-namespaces
filter (3-5K items) the backup succeeds.
Looks like it's similar to #4158
1.11.0 (velero server) / 1.7.0 (azure plugin) / velero-aks-backup.log
@dkadetov
time="2023-05-10T05:23:36Z" level=debug msg="plugin process exited" backup=velero/daily-cluster-20230510-3 cmd=/plugins/velero-plugin-for-microsoft-azure error="signal: killed" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-microsoft-azure pid=86
This line is the reason why the backup failed.
Can you enlarge the resource allocated for the Velero pod, e.g. memory and CPU?
Usually, the signal killed
means the kubelet kills the process due to OOM, but we still don't know what makes the plugin consumes resources.
@blackpiglet The pod was allocated 512MB / 1CPU, but even at its peak it did not consume half that.
It's better that we can find some hints about why the process is killed.
Is there any information in the Velero pod? For example,
kubectl -n veleror describe pod <velero-pod-id>
Or we can try to find some trace on the Velero pod's allocated node. There should be some hints in the dmesg
command's output.
@blackpiglet It looks like you were right. The pod lacked resources. Due to the short duration of the loads, the monitoring system did not catch the actual consumption of resources in time. Increasing the resource request has added stability.
Close for now. If anything is still unclear, feel free to reopen.
What steps did you take and what happened: Let the schedule try to do a backup or run this command velero backup create xxxxx
What did you expect to happen: No backups with the status Failed
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.) kubectl logs deployment/velero -n velero | grep level=error | grep velero-backup6h-20220120171346 time="2022-01-20T17:15:44Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin562653323: connect: connection refused\"]" key=velero/velero-backup6h-20220120171346 logSource="pkg/controller/backup_controller.go:281"
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
velero version
): 1.7.1velero client config get features
): features:kubectl version
): 1.20.12This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.