vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.74k stars 1.41k forks source link

Velero fails to pass region on to restic leading to failing restic init on DellEMC ECS S3 object store #6246

Open mpsOxygen opened 1 year ago

mpsOxygen commented 1 year ago

What steps did you take and what happened:

Created a kind cluster and installed velero with the following command:

velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.6.1 \ --bucket velero \ --use-node-agent \ --default-volumes-to-fs-backup \ --secret-file ./velero-creds \ --backup-location-config region=emcreg,s3ForcePathStyle="true",s3Url=http://ecs.metaminds.com:9021,insecureSkipTLSVerify=true \ --snapshot-location-config region=emcreg,s3ForcePathStyle="true",s3Url=http://ecs.metaminds.com:9021,insecureSkipTLSVerify=true

It installs fine, but when trying a velero backup create I get errors when doing restic init.

What did you expect to happen: The velero backup create command to work without errors.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Phase: PartiallyFailed (run velero backup logs bibi for more information)

Errors: Velero: name: /envoy-9kbh4 error: /failed to wait BackupRepository: backup repository is not ready: error running command=restic init --repo=s3:http://ecs.metaminds.com:9021/velero/restic/projectcontour --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic --insecure-tls=true, stdout=, stderr=Fatal: create key in repository at s3:http://ecs.metaminds.com:9021/velero/restic/projectcontour failed: Stat: Access Denied.

: exit status 1 name: /node-agent-dgfr4 error: /failed to wait BackupRepository: backup repository is not ready: error running command=restic init --repo=s3:http://ecs.metaminds.com:9021/velero/restic/velero --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic --insecure-tls=true, stdout=, stderr=Fatal: create key in repository at s3:http://ecs.metaminds.com:9021/velero/restic/velero failed: Stat: Access Denied.

: exit status 1 name: /velero-7db7f89669-9h7kv error: /backup repository is not ready: error running command=restic init --repo=s3:http://ecs.metaminds.com:9021/velero/restic/velero --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic --insecure-tls=true, stdout=, stderr=Fatal: create key in repository at s3:http://ecs.metaminds.com:9021/velero/restic/velero failed: Stat: Access Denied.

: exit status 1 Cluster: Namespaces:

Namespaces: Included: * Excluded:

Resources: Included: * Excluded: Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

CSISnapshotTimeout: 10m0s ItemOperationTimeout: 1h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2023-05-09 13:10:19 +0300 EEST Completed: 2023-05-09 13:10:27 +0300 EEST

Expiration: 2023-06-08 13:10:19 +0300 EEST

Velero-Native Snapshots: `

Anything else you would like to add:

The problem is with velero not passing on the region information to restic when it calls restic init. Because the DellEMC ECS does not have a region you need to pass it on to restic using the switch "-o s3.region=emcreg". The value for s3.region can be anything and restic init succeds, without it the region is left blank and you get a white space in the Authorization header where the region should be that leads to a failed authorization.

This is how the header looks after a restic init without -o s3.region= :

Authorization: AWS4-HMAC-SHA256 Credential=AKIA204A86AD42312497/20230320/ /s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=ec35cdcab04357e3c86ba480cf6235b841a82fe6dd150007957dd49347ded518\r\n

Notice the white space before /s3/aws4_request.

And this is how the header looks after a restic init with -o s3.region=emcreg:

Authorization: AWS4-HMAC-SHA256 Credential=AKIA204A86AD42312497/20230320/emcreg/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=4c035d6719d6f8405af4b58106ea5b2b3951cba0841913065150942642586fbe\r\n

Velero was installed with the region=emcreg set so it should pass it on to restic using -o s3.region=emcreg.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 1 year ago

@mpsOxygen For the usage of s3Url, if it is specified, Velero expects it to include all the info in order to access the object store, for example, s3-<region>.amazonaws.com, which means the region has been embed into the s3Url. In another word, if s3Url is not empty, the region specified separately in the BSL will not be honored.

This is the current behavior of the code, could you check if you can modify the s3Url for your environment to include the region info? On the other hand, Restic path for file system backup will be suppressed in the following releases of Velero, so I suggest you to try with Kopia path which will be the default path in the following releases.

mpsOxygen commented 1 year ago

If I understand correctly you are saying I should use s3-emcreg.ecs.metaminds.com? I will test that out with the kind cluster.

We made a workaround for the problem using our F5 in order to make de ECS respond to region requests with emcreg. It was a pretty simple iRule, but I feel like region option should be honored. I haven't tested Kopia, but we did test Kasten (which uses Kopia) and had the exact same problem that we solved with the F5 iRule.

Lyndon-Li commented 1 year ago

@mpsOxygen Could you confirm if setting the s3Url as s3-emcreg.ecs.metaminds.com works?

mpsOxygen commented 1 year ago

I've tried like this and it says no such bucket exists (the bucket name velero does exist on the ECS):

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.7.0 \
    --bucket velero \
    --use-node-agent \
    --default-volumes-to-fs-backup \
    --secret-file ./velero-creds \
    --backup-location-config region=emcreg,s3Url=http://s3-emcreg.ecs.metaminds.com:9021,insecureSkipTLSVerify=true
Lyndon-Li commented 1 year ago

@mpsOxygen From the error, looks like this time it has connected to the object store service, but the existing bucket doesn't match to the given region.

Could you try the Kopia path by reinstall Velero with below command?

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.7.0 \
    --bucket velero \
    --use-node-agent \
    --default-volumes-to-fs-backup \
    --secret-file ./velero-creds \
    --uploader-type kopia \
    --backup-location-config region=emcreg,s3Url=http://s3-emcreg.ecs.metaminds.com:9021,insecureSkipTLSVerify=true

Then run a file system backup, it will go with Kopia path.

Lyndon-Li commented 1 year ago

@mpsOxygen Please also add the s3ForcePathStyle=true to the BSL config and try both the restic and kopia path.

So for restic path, the installation is as below:

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.7.0 \
    --bucket velero \
    --use-node-agent \
    --default-volumes-to-fs-backup \
    --secret-file ./velero-creds \
    --backup-location-config region=emcreg,s3ForcePathStyle=true,s3Url=http://s3-emcreg.ecs.metaminds.com:9021,insecureSkipTLSVerify=true

For Kopia path, the installation is as below: velero install \

    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.7.0 \
    --bucket velero \
    --use-node-agent \
    --default-volumes-to-fs-backup \
    --secret-file ./velero-creds \
    --uploader-type kopia \
    --backup-location-config region=emcreg,s3ForcePathStyle=true,s3Url=http://s3-emcreg.ecs.metaminds.com:9021,insecureSkipTLSVerify=true
mpsOxygen commented 1 year ago

velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.7.0 \ --bucket velero \ --use-node-agent \ --default-volumes-to-fs-backup \ --secret-file ./velero-creds \ --backup-location-config region=emcreg,s3ForcePathStyle=true,s3Url=http://s3-emcreg.ecs.metaminds.com:9021,insecureSkipTLSVerify=true

This one fails with no such bucket: time="2023-07-05T09:20:52Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = NoSuchBucket: The specified bucket does not exist\n\tstatus code: 404, request id: ac1e420b:188b3cb4945:98d0:1, host id: " error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:426" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:107"

I've checked the credentials and the bucket with CyberDuck and it's all there.

velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.7.0 \ --bucket velero \ --use-node-agent \ --default-volumes-to-fs-backup \ --secret-file ./velero-creds \ --uploader-type kopia \ --backup-location-config region=emcreg,s3ForcePathStyle=true,s3Url=http://s3-emcreg.ecs.metaminds.com:9021,insecureSkipTLSVerify=true

This one fails with no such bucket as well: time="2023-07-05T09:24:22Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = NoSuchBucket: The specified bucket does not exist\n\tstatus code: 404, request id: ac1e420e:188b3cb4ecf:97cc:c8e, host id: " error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:426" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:107"

I've also done a Wireshark of the commands and they do seem to add the region correctly in the request, but I can't figure out why it says no such bucket.

LE: Dug a bit more with Wireshark and it looks like it's searching for a bucket named emcreg/velero instead of just velero.

Lyndon-Li commented 1 year ago

@mpsOxygen For Kopia path, could you try with the normal endpoint as the s3Url http://ecs.metaminds.com:9021? For Kopia path, the region could be set separately so don't need to embed it into s3Url. Installation command as below:

velero install
--provider aws
--plugins velero/velero-plugin-for-aws:v1.7.0
--bucket velero
--use-node-agent
--default-volumes-to-fs-backup
--secret-file ./velero-creds
--uploader-type kopia
--backup-location-config region=emcreg,s3ForcePathStyle=true,s3Url=http://ecs.metaminds.com:9021,insecureSkipTLSVerify=true/
mpsOxygen commented 1 year ago

Still fails without the the hack for the region on the F5:

`kubectl logs deployment/velero -n velero | grep error

Defaulted container "velero" out of: velero, velero-velero-plugin-for-aws (init) time="2023-08-23T08:00:56Z" level=error msg="Error listing backups in backup store" backupLocation=velero/default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get \"http://ecs.metaminds.com:9021/velero?delimiter=%2F&list-type=2&prefix=backups%2F\": EOF" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:426" error.function="main.(ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:107" time="2023-08-23T08:00:57Z" level=error msg="fail to validate backup store" backup-storage-location=velero/default controller=backup-storage-location error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get \"http://ecs.metaminds.com:9021/velero?delimiter=%2F&list-type=2&prefix=\": EOF" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/persistence/object_store.go:198" error.function="github.com/vmware-tanzu/velero/pkg/persistence.(objectBackupStore).IsValid" logSource="pkg/controller/backup_storage_location_controller.go:155" time="2023-08-23T08:00:57Z" level=error msg="Current BackupStorageLocations available/unavailable/unknown: 0/0/1, BackupStorageLocation \"default\" is unavailable: rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get \"http://ecs.metaminds.com:9021/velero?delimiter=%2F&list-type=2&prefix=\": EOF)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:192" `

We have given up on the DellEMC ECS and are going to use MinIO instead. Thanks for all the help.

natkondrashova commented 1 year ago

I've faced the similar issue with cross-region AWS S3(I'm trying to restore backup from S3 in eu-west-1 to EKS in eu-central-1):

time="2023-08-30T20:13:43Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="backup repository is not ready: error running command=restic init --repo=s3:s3-eu-west-1.amazonaws.com/{{ BUCKET_NAME }}/restic/my-test-backups --password-file=/tmp/credentials/velero/velero-repo-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-eu-west-1.amazonaws.com/{{ BUCKET_NAME }}/restic/my-test-backups failed: client.BucketExists: 301 Moved Permanently\n\n: exit status 1" logSource="pkg/restore/restore.go:1699" restore=velero/test-20230830221335

With the following backupstoragelocations.velero.io:

➜ k -n velero get backupstoragelocations.velero.io default -o yaml
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  annotations:
    meta.helm.sh/release-name: velero
    meta.helm.sh/release-namespace: velero
  creationTimestamp: "2023-08-30T18:33:01Z"
  generation: 250
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-5.0.2
  name: default
  namespace: velero
  resourceVersion: "14228053"
  uid: 1483aea5-0a73-4edf-a58e-791e8eba6083
spec:
  accessMode: ReadWrite
  config:
    region: eu-west-1
  default: true
  objectStorage:
    bucket: {{ BUCKET_NAME }}
  provider: aws
status:
  lastSyncedTime: "2023-08-30T20:54:53Z"
  lastValidationTime: "2023-08-30T20:55:13Z"
  phase: Available

But this command works good:

➜ k -n velero exec -it node-agent-n6gtk -- restic init --repo=s3:s3-eu-west-1.amazonaws.com/{{ BUCKET_NAME }}/restic/my-test-backups -o s3.region=eu-west-1 --cache-dir=/scratch/.cache/restic
enter password for new repository:
enter password again:
created restic repository 292b3b2667 at s3:s3-eu-west-1.amazonaws.com/{{ BUCKET_NAME }}/restic/my-test-backups

So the issue is definitely exists. Should I create a separate issue about that? Or may be you can recommend me a workaround for that?

Lyndon-Li commented 1 year ago

@natkondrashova This doesn't look the same with the original one, so please open a new issue. Besides, not sure which version are you using, if not the latest version, just try to use the latest version first.