scheduled backups are failing 50% of the time connect: connection refused

julienroussel00 commented 2 years ago

What steps did you take and what happened: Let the schedule try to do a backup or run this command velero backup create xxxxx

What did you expect to happen: No backups with the status Failed

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.) kubectl logs deployment/velero -n velero | grep level=error | grep velero-backup6h-20220120171346 time="2022-01-20T17:15:44Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin562653323: connect: connection refused\"]" key=velero/velero-backup6h-20220120171346 logSource="pkg/controller/backup_controller.go:281"

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version): 1.7.1
Velero features (use velero client config get features): features:
Kubernetes version (use kubectl version): 1.20.12
Cloud provider or hardware configuration: Azure Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

qiuming-best commented 2 years ago

@julienroussel00, backup_controller.go:281, there may be some network problems with uploading backup files to the object storage. you can check the bsl status with the command velero backup-location get

And you can upload the backup logs for me to track the issue with command velero bacup logs velero-backup6h-20220120171346 Also if you can provide the velero pod logs is much better with command 'kubectl logs -n kube-system velero-xxx`

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

blackpiglet commented 2 years ago

Close for now. If any update, feel free to reopen

matthiasritter commented 2 years ago

Hi, I believe we have the same problem:

Velero version 1.9.0 (with CRDs in version 1.9.0).
Azure Velero plugin (version 1.5.0)
Kubernetes version 1.21.14 (Azure)

.... A LOT OF info-msg..... AND AT THE END: time="2022-07-27T00:02:22Z" level=info msg="Skipping item because it's already been backed up." backup=velero/velero-full-backup-20220727000056 logSource="pkg/backup/item_backupper.go:117" name=policyreports.wgpolicyk8s.io namespace= resource=customresourcedefinitions.apiextensions.k8s.io time="2022-07-27T00:02:22Z" level=info msg="Backed up a total of 6860 items" backup=velero/velero-full-backup-20220727000056 logSource="pkg/backup/backup.go:405" progress= time="2022-07-27T00:02:22Z" level=info msg="Setting up backup store to persist the backup" backup=velero/velero-full-backup-20220727000056 logSource="pkg/controller/backup_controller.go:720" time="2022-07-27T00:02:23Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20220727000056 controller=backup logSource="pkg/controller/backup_controller.go:730" time="2022-07-27T00:02:23Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unknown desc = EOF, rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin1425071790: connect: connection refused\"]" key=velero/velero-full-backup-20220727000056 logSource="pkg/controller/backup_controller.go:298" Here our schedule where you can see when it happens: velero-full-backup-20220727000056 Failed 0 0 2022-07-27 02:00:56 +0200 CEST 13d default <none> velero-full-backup-20220726000055 Completed 0 0 2022-07-26 02:00:55 +0200 CEST 12d default <none> velero-full-backup-20220725000054 Completed 0 0 2022-07-25 02:00:55 +0200 CEST 11d default <none> velero-full-backup-20220724000053 Failed 0 1 2022-07-24 02:00:53 +0200 CEST 10d default <none> velero-full-backup-20220723000052 PartiallyFailed 1 0 2022-07-23 02:00:53 +0200 CEST 9d default <none> velero-full-backup-20220722000051 Completed 0 0 2022-07-22 02:00:52 +0200 CEST 8d default <none> velero-full-backup-20220721000047 Completed 0 0 2022-07-21 02:00:47 +0200 CEST 7d default <none> velero-full-backup-20220720000019 Completed 0 0 2022-07-20 02:00:19 +0200 CEST 6d default <none> velero-full-backup-20220719000045 Failed 0 0 2022-07-19 02:00:45 +0200 CEST 5d default <none> velero-full-backup-20220718000040 Failed 0 0 2022-07-18 02:00:40 +0200 CEST 4d default <none> velero-full-backup-20220717000024 Failed 0 0 2022-07-17 02:00:24 +0200 CEST 3d default <none> velero-full-backup-20220716000015 Failed 0 0 2022-07-16 02:00:15 +0200 CEST 2d default <none> velero-full-backup-20220715000014 Completed 0 0 2022-07-15 02:00:14 +0200 CEST 1d default <none> velero-full-backup-20220714000000 Failed 0 0 2022-07-14 02:00:00 +0200 CEST 18h default <none> Output of velero backup-location: velero backup-location get, seems good (we can see all the time in the logs, that it is successfully checked: NAME PROVIDER BUCKET/PREFIX PHASE LAST VALIDATED ACCESS MODE DEFAULT default azure velero-backups Available 2022-07-27 07:32:51 +0200 CEST ReadWrite true

Please let me know if you need more information.

matthiasritter commented 2 years ago

I just found these lines in the logs:

time="2022-07-25T21:03:18Z" level=error msg="Error getting a backup store" backup-storage-location=velero/default controller=backup-storage-location error="rpc error: code = Unknown desc = azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/XXXXXXXXXXXX/providers/Microsoft.Storage/storageAccounts/XXXXXXlistKeys?%24expand=kerb&api-version=2019-06-01: StatusCode=403 -- Original Error: adal: Refresh request failed. Status Code = '403'. Response body: failed to acquire a token using the MSI VM extension, error: MSI not available\n Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fmanagement.azure.com%2F" error.file="/go/src/velero-plugin-for-microsoft-azure/velero-plugin-for-microsoft-azure/object_store.go:217" error.function=main.getStorageAccountKey logSource="pkg/controller/backup_storage_location_controller.go:126"

@blackpiglet @julienroussel00 can you please reopen this issue?

matthiasritter commented 2 years ago

And another error message that could be related:

time="2022-08-02T00:02:05Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20220802000031 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-08-02T00:02:05Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = error reading from server: EOF" key=velero/velero-full-backup-20220802000031 logSource="pkg/controller/backup_controller.go:298"

The backup-files in the (azure) storage-account are also missing.

Here some files are missing: missing This is a successful backup:

matthiasritter commented 2 years ago

And again some more log-output of the schedule which failed tonight:

time="2022-08-04T00:01:24Z" level=info msg="Backed up a total of 2338 items" backup=velero/velero-full-backup-20220804000025 logSource="pkg/backup/backup.go:405" progress=
time="2022-08-04T00:01:24Z" level=info msg="Setting up backup store to persist the backup" backup=velero/velero-full-backup-20220804000025 logSource="pkg/controller/backup_controller.go:720"
time="2022-08-04T00:01:25Z" level=info msg="Plugin process exited - restarting." backup=velero/velero-full-backup-20220804000025 logSource="pkg/plugin/clientmgmt/restartable_process.go:155"
time="2022-08-04T00:01:25Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-08-04T00:01:25Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-08-04T00:01:26Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20220804000025 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-08-04T00:01:26Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = error reading from server: EOF, rpc error: code = Unavailable desc = connection closed]" key=velero/velero-full-backup-20220804000025 logSource="pkg/controller/backup_controller.go:298"

matthiasritter commented 2 years ago

Hi again, I tried to reproduce the problem with avctivated debug-logging. After the 8 try I was successful: backuptest1 Completed 0 0 2022-08-08 15:30:23 +0200 CEST 29d default backuptest2 Completed 0 0 2022-08-08 15:31:47 +0200 CEST 29d default backuptest3 Completed 0 0 2022-08-08 15:34:11 +0200 CEST 29d default backuptest4 Completed 0 0 2022-08-08 15:36:36 +0200 CEST 29d default backuptest5 Completed 0 0 2022-08-08 15:41:12 +0200 CEST 29d default backuptest6 Completed 0 0 2022-08-08 15:42:30 +0200 CEST 29d default backuptest7 Completed 0 0 2022-08-08 15:43:39 +0200 CEST 29d default backuptest8 Failed 0 0 2022-08-08 15:44:44 +0200 CEST 29d default

I attached the debug-logs from the last 3 tries. Hope this helps velero_debug.log.zip

blackpiglet commented 2 years ago

@matthiasritter Sorry for late response. Recently, Velero team is busy with v1.10 feature developing and v1.9.1 bug fixing. What is the version of Azure plugin you are using with Velero v1.9.0? Looks like Azure plugin restarting caused backup failure.

TheKangaroo commented 2 years ago

Hi @blackpiglet, I'm a colleague of @matthiasritter. We run Azure plugin v1.5.0 and velero 1.9.0.

blackpiglet commented 2 years ago

I thought this issue was caused by azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request. According to your recent uploaded log, looks like this issue happens intermittently after Azure plugin running for a while. This reminds me another issue #4863. In that issue, CSI plugin also failed after running for a while, and the reason is the plugin framework package used by Velero's version didn't consume the os.Stdout and os.Stderr. After a while, eventually the CSI plugin was blocked since it reached the pipe's max capacity with the accumulated warning messages, but CSI plugin will not return error, so that part is different, but I think that is because the implementation is not same in the two plugins.

Although this issue was fixed in Velero, but Velero plugins still import the v1.7.1 version of Velero, so the fix is not included yet. I think need to update the plugins go.mod file to use new version of Velero.

blackpiglet commented 2 years ago

After taking another look, I found the HarshiCorp go-plugin version is already updated for this version of Azure plugin. Currently I couldn't find the reason why plugin process exited during backup metadata uploading. @ywk253100 Could you give some insight?

matthiasritter commented 2 years ago

Hi @blackpiglet @ywk253100 ,

Is there an update for this? The error still occurs with us.

failureReason: '[rpc error: code = Unavailable desc = error reading from server: EOF, rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin3466263320: connect: connection refused"]'

Please feel free to let me know if you need more info.

matthiasritter commented 2 years ago

Hi @blackpiglet @ywk253100 we are still seeing this error:

time="2022-10-10T00:01:26Z" level=info msg="Backed up a total of 2645 items" backup=velero/velero-full-backup-20221010000049 logSource="pkg/backup/backup.go:405" progress=
time="2022-10-10T00:01:26Z" level=info msg="Setting up backup store to persist the backup" backup=velero/velero-full-backup-20221010000049 logSource="pkg/controller/backup_controller.go:720"
time="2022-10-10T00:01:28Z" level=info msg="Plugin process exited - restarting." backup=velero/velero-full-backup-20221010000049 logSource="pkg/plugin/clientmgmt/restartable_process.go:155"
time="2022-10-10T00:01:28Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-10-10T00:01:28Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-10-10T00:01:28Z" level=info msg="Backup completed" backup=velero/velero-full-backup-20221010000049 controller=backup logSource="pkg/controller/backup_controller.go:730"
time="2022-10-10T00:01:28Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = error reading from server: EOF, rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin586144313: connect: connection refused\"]" key=velero/velero-full-backup-20221010000049 logSource="pkg/controller/backup_controller.go:298"

Is there any update?

blackpiglet commented 2 years ago

Is the backup ended up with Failed status? Could you post the backup's Status section?

matthiasritter commented 2 years ago

Hi @blackpiglet , yes, as described in the log-output "backup completed" and then "backup failed":

velero-full-backup-20221010000049   Failed      0        0          2022-10-10 02:00:49 +0200 CEST   13d       default            <none>

❯ velero describe backup velero-full-backup-20221010000049
Name:         velero-full-backup-20221010000049
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-2.30.1
              velero.io/schedule-name=velero-full-backup
              velero.io/storage-location=default
Annotations:  helm.sh/hook=post-install,post-upgrade,post-rollback
              helm.sh/hook-delete-policy=before-hook-creation
              velero.io/source-cluster-k8s-gitversion=v1.23.12
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=23

Phase:  Failed (run `velero backup logs velero-full-backup-20221010000049` for more information)

Errors:    0
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  true

TTL:  336h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2022-10-10 02:00:49 +0200 CEST
Completed:  2022-10-10 02:01:26 +0200 CEST

Expiration:  2022-10-24 02:00:49 +0200 CEST

Total items to be backed up:  2645
Items backed up:              2645

Velero-Native Snapshots:  2 of 2 snapshots completed successfully (specify --details for more information)

dkadetov commented 1 year ago

I have a very similar problem when trying to make a backup of an entire aks cluster (over 14K items). But when I apply --include-namespaces filter (3-5K items) the backup succeeds.

Looks like it's similar to #4158

1.11.0 (velero server) / 1.7.0 (azure plugin) / velero-aks-backup.log

blackpiglet commented 1 year ago

@dkadetov

time="2023-05-10T05:23:36Z" level=debug msg="plugin process exited" backup=velero/daily-cluster-20230510-3 cmd=/plugins/velero-plugin-for-microsoft-azure error="signal: killed" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-microsoft-azure pid=86

This line is the reason why the backup failed. Can you enlarge the resource allocated for the Velero pod, e.g. memory and CPU? Usually, the signal killed means the kubelet kills the process due to OOM, but we still don't know what makes the plugin consumes resources.

dkadetov commented 1 year ago

@blackpiglet The pod was allocated 512MB / 1CPU, but even at its peak it did not consume half that.

blackpiglet commented 1 year ago

It's better that we can find some hints about why the process is killed.

Is there any information in the Velero pod? For example,

kubectl -n veleror describe pod <velero-pod-id>

Or we can try to find some trace on the Velero pod's allocated node. There should be some hints in the dmesg command's output.

dkadetov commented 1 year ago

@blackpiglet It looks like you were right. The pod lacked resources. Due to the short duration of the loads, the monitoring system did not catch the actual consumption of resources in time. Increasing the resource request has added stability.

blackpiglet commented 1 year ago

Close for now. If anything is still unclear, feel free to reopen.

vmware-tanzu / velero

scheduled backups are failing 50% of the time connect: connection refused #4557