vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.66k stars 1.39k forks source link

Fatal: unable to open config file results in PartiallyFailed Backup #8263

Open amrap030 opened 1 week ago

amrap030 commented 1 week ago

What steps did you take and what happened:

Unfortunately my backups end up being PartiallyFailed due to the following error:

Errors:
  Velero:   message: /pod volume backup failed: data path backup failed: error running restic backup command restic backup --repo=s3:https://***.net/<bucketname>/velero/restic/kube-system --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic . --tag=pod-uid=1927b692-dda3-4994-b047-335921d6dc2c --tag=volume=socket-dir --tag=backup=velero-daily-20241004171637 --tag=backup-uid=4afba32a-2995-48e1-bd80-cc811de09aeb --tag=ns=kube-system --tag=pod=openstack-cinder-csi-controllerplugin-7f8cf7f5cb-r8ppl --host=velero --json with error: exit status 1 stderr: Fatal: unable to open config file: Stat: Get "https://***.net/<bucketname>/?location=": dial tcp: lookup ***.net: i/o timeout
Is there a repository at the following location?
s3:https://***.net/<bucketname>/velero/restic/kube-system

However, when looking into my bucket with an S3 viewer, there is the repository /velero/restic/kube-system and it also contains the config file along with the snapshots etc.

I already tried setting various proxy settings, because I run this on-premise and the S3 bucket is an on-premise enterprise object storage, but without success. Since the backup files are uploaded to the S3 buckets just fine, I assume the proxy settings are not relevant. I also tried to install restic on my local machine and tried to verify the repository via restic -r s3:https://***.net/<bucketname>/velero/restic/kube-system snapshots which works just fine.

Additionally, I am using the velero/velero-plugin-for-aws:v1.9.0 plugin, as it is an S3 compatible storage.

Since I am running everything in our on-premise environment, I don't really want to add the debug information bundle as it might contain sensitive data.

What did you expect to happen:

I expect that the backup executes just fine without being PartiallyFailed.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 5 days ago

Could you try kopia path instead? restic path is being deprecated, so we are not going to work on restic path for troubleshooting or enhancements

amrap030 commented 3 days ago

@Lyndon-Li yes I will try that and post my results

amrap030 commented 2 days ago

@Lyndon-Li with kopia I am getting a similar error:

Errors:
  Velero:    message: /pod volume backup failed: error to initialize data path: error to boost backup repository connection default-kube-system-kopia: error to connect backup repo: error to connect to storage: error retrieving storage config from bucket "expcs3mbvd-uptime": Get "https://***.net/expcs3mbvd-uptime/velero/kopia/kube-system/.storageconfig": dial tcp: lookup ***.net: i/o timeout
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  *
  Excluded:  velero

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  168h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-10-09 10:10:59 +0200 CEST
Completed:  2024-10-09 10:16:19 +0200 CEST

Expiration:  2024-10-16 10:10:59 +0200 CEST

Total items to be backed up:  963
Items backed up:              963

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots: <none included>

  Pod Volume Backups - kopia:
    Completed:
      argocd/argocd-application-controller-0: argocd-home
      argocd/argocd-applicationset-controller-57f56b4dd5-q4j5f: gpg-keyring, tmp
      argocd/argocd-dex-server-65db84595d-8btc8: dexconfig, static-files
      argocd/argocd-server-6587765cbb-qxdk9: plugins-home, tmp
      kube-system/metrics-server-v0.7.1-685874c7b8-vx464: tmp-dir
      monitoring/kube-prometheus-stack-grafana-79d64d6566-v9kp5: sc-dashboard-volume, sc-datasources-volume, sc-plugins-volume, storage
      monitoring/prometheus-kube-prometheus-stack-prometheus-0: config-out, prometheus-kube-prometheus-stack-prometheus-db
      monitoring/uptime-kuma-67bdd4dd-bkwkv: storage
      networking/traefik-7bb494677b-nf74h: data, tmp
      trivy-system/trivy-operator-6758798dc6-mn4kv: cache-policies
    Failed:
      kube-system/openstack-cinder-csi-controllerplugin-7f8cf7f5cb-r8ppl: socket-dir
HooksAttempted:  0
HooksFailed:     0

Still not sure why the error occurs because the directory kube-system actually exists.