Restic restore failing when restoring backup to different EKS cluster in different AWS region

rizblie commented 2 years ago

What steps did you take and what happened:

Setup

I am running two Amazon EKS clusters, one in us-east-2 (primary) and one in us-west-1 (secondary)
Kubernetes version 1.23
I created an S3 bucket in us-east-2.

Steps

I installed velero 1.9.2 (using Helm) with Restic on both clusters, using the same bucket, with access mode set to ReadWrite in primary and ReadOnly in secondary
I installed wordpress on the primary cluster using helm
I created a backup of the wordpress namespace on the primary cluster
I deleted the wordpress namespace from the primary cluster
I successfully restored the wordpress namespace, resources, and volumes from the backup. So far so good.
I then tried to restore the same backup on the secondary cluster.
The Kubernetes resources were restored, but Restic failed to restore the volumes.
I can see that EBS volumes were in fact created in the secondary region, but Restic failed to restore the data.

Errors from restore describe as follows:

Errors:
Velero:  restic repository is not ready: error running command=restic init --repo=s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress failed: client.BucketExists: Head "https://043124067543-velero-primary.s3.dualstack.us-west-1.amazonaws.com/": 301 response missing Location header

: exit status 1
  restic repository is not ready: error running command=restic init --repo=s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress failed: client.BucketExists: Head "https://043124067543-velero-primary.s3.dualstack.us-west-1.amazonaws.com/": 301 response missing Location header

: exit status 1

I am confused by the fact that the restore action is executing a restic init. The repository already exists, so it just needs an integrity check?

See attached debug bundle.

My helm values file for velero on the secondary is as follows: (primary is similar but ReadWrite, and different role with same permissions)

image:
  repository: velero/velero
  tag: v1.9.2
  pullPolicy: IfNotPresent
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.5.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
configuration:
  provider: aws
  # features: EnableCSI
  defaultVolumesToRestic: true
  backupStorageLocation:
    name: primary
    bucket: 043124067543-velero-primary
    accessMode: ReadOnly
    default: true
    config:
      region: us-east-2
deployRestic: true
restic:
  podVolumePath: /var/lib/kubelet/pods
  privileged: false
  # Pod priority class name to use for the Restic daemonset. Optional.
  priorityClassName: ""
  # Resource requests/limits to specify for the Restic daemonset deployment. Optional.
  # https://velero.io/docs/v1.6/customize-installation/#customize-resource-requests-and-limits
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1024Mi
serviceAccount:
  server:
    create: true
    name: veleros3
    annotations: 
      eks.amazonaws.com/role-arn: "arn:aws:iam::043124067543:role/ServiceAccount-Velero-Backup-Secondary"

What did you expect to happen:

I expected the Restic volume restore to work in the secondary region, just as it did in the primary region.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add: I am not sure if this is a bug, or if I am doing something wrong in my config. It works fine on the same cluster in the same region, so what is different about a different cluster/region that I may have required a different config parameter somewhere?

Environment:

Velero version (use velero version): 1.9.2
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): v1.23.10-eks-15b7512
Kubernetes installer & version: EKS
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Amazon Linux 2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now" bundle-2022-10-05-20-15-10.tar.gz

rizblie commented 2 years ago

Just adding the output of velero restic repo get wordpress-primary-8gzr6 -o yaml on the secondary cluster. Again it shows the failed restic init on a repo that already exists. Why is it doing an init?

apiVersion: velero.io/v1
kind: ResticRepository
metadata:
  creationTimestamp: "2022-10-05T18:23:05Z"
  generateName: wordpress-primary-
  generation: 3
  labels:
    velero.io/storage-location: primary
    velero.io/volume-namespace: wordpress
  managedFields:
  - apiVersion: velero.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName: {}
        f:labels:
          .: {}
          f:velero.io/storage-location: {}
          f:velero.io/volume-namespace: {}
      f:spec:
        .: {}
        f:backupStorageLocation: {}
        f:maintenanceFrequency: {}
        f:resticIdentifier: {}
        f:volumeNamespace: {}
      f:status:
        .: {}
        f:message: {}
        f:phase: {}
    manager: velero-server
    operation: Update
    time: "2022-10-05T18:23:26Z"
  name: wordpress-primary-8gzr6
  namespace: velero
  resourceVersion: "39841"
  uid: ddca26ef-88e9-4055-bb99-b778038b8cb7
spec:
  backupStorageLocation: primary
  maintenanceFrequency: 168h0m0s
  resticIdentifier: s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress
  volumeNamespace: wordpress
status:
  message: |-
    error running command=restic init --repo=s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: create repository at s3:s3-us-east-2.amazonaws.com/043124067543-velero-primary/restic/wordpress failed: client.BucketExists: Head "https://043124067543-velero-primary.s3.dualstack.us-west-1.amazonaws.com/": 301 response missing Location header

    : exit status 1
  phase: NotReady

sseago commented 2 years ago

I'm not sure what's going on, but I notice in that error message that something is trying to access the bucket using a us-west-1 URL rather than us-east-2. It could be that some code in the restic/velero codebase is pulling region from the wrong location.

rizblie commented 2 years ago

Thanks @sseago. Yes no matter what I try, Restic ignores the region I have set and tries to connect to S3 bucket using the region that the cluster is running in.

I have tried using all of the the following with no success

backupStorageLocation.config.region: us-east-2
backupStorageLocation.config.s3Url: https://043124067543-velero-primary.s3.dualstack.us-east-2.amazonaws.com/
volumeSnapshotLocation.config.region: us-east-2

So the question boils down to: what is the correct way to tell Restic to use a bucket in a different region to the one it is running in?

rizblie commented 2 years ago

Lookoing at the restic docs, I think I need to figure out a way to get velero to add the option -o s3.region="us-east-2" when calling restic init. Is there any way to configure velero to add option parameters to restic commands?

blackpiglet commented 2 years ago

There is no easy way to add a new parameter in the Restic command. From the Restic document you post, I think adding an environment variable AWS_DEFAULT_REGION in Velero server deployment may make it works.

sseago commented 2 years ago

I'm not sure what's going on here. Restic shouldn't be using the region the cluster is running in -- it should be using the BSL region. If restic is using cluster region instead of BSL region, that sounds like a bug. We shouldn't need to pass this in separately to restic. Restic should use the value from the BSL somehow.

rizblie commented 2 years ago

@sseago agreed, but it is Velero that is invoking Restic, and the BSL is a Velero object. So Velero somehow needs to communicate that BSL region id through to the Restic CLI - which is currently not happening. Agreed it is a bug.

The Restic docs only seem to offer 2 ways to do this: an environment variable, or a command line option.

rizblie commented 2 years ago

There is no easy way to add a new parameter in the Restic command. From the Restic document you post, I think adding an environment variable AWS_DEFAULT_REGION in Velero server deployment may make it works.

I tried this by changing the Restic DaemonSet container spec to include:

      env:
      - name: AWS_DEFAULT_REGION
        value: us-east-2

Then I restarted the Restic pods, but unfortunately it did not work. Got the same error as reported previously.

sseago commented 2 years ago

One other thing to try. Looking at restic github issues, at least one user who had this error resolved it by updating the IAM policy to add "s3:GetBucketLocation". Since the failure happens when the initial request (to the default region) attempts to redirect to a different region, it's possible that this permission is missing. I'm not sure this will help (since it may be that in this case we're dealing with the opposite problem -- restic attempting to redirect to the wrong region), but it's worth trying. If you add this to your user bucket policy, does it help?

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"statement1",
         "Effect":"Allow",
         "Action":[
            "s3:ListAllMyBuckets", 
            "s3:GetBucketLocation"  
         ],
         "Resource":[
            "arn:aws:s3:::*"
         ]
       }
    ]
}

rizblie commented 2 years ago

Thanks @sseago , but I already had that permission in the IAM policy.

Lyndon-Li commented 2 years ago

Velero always set the region name in the AWS URL like this https://bucket-name.s3.region-code.amazonaws.com, where the region-code is replaced by the value specified in backupStorageLocation.config. For Restic, if AWS_DEFAULT_REGION is not set, it (actually the minio client) gets the region name from the URL; otherwise, it respects the value in AWS_DEFAULT_REGION all the time.

Therefore, generally, this behavior works in the case mentioned in the current issue. It means the current issue is not a generic problem.

Lyndon-Li commented 2 years ago

We may need to check where the region name us-west-1 is specified, because if this value is not set in any place, it should not go to the connection URL. It must not be set in BSL of Velero, because if we check the Restic command Velero runs, the region name is correct: --repo=s3:s3-us-east-2.amazonaws.com. Therefore, is there any possibility that AWS_DEFAULT_REGION is set once more and overwriten with us-west-1?

reasonerjt commented 1 year ago

@rizblie I tried to reproduce the problem using velero v1.10.0, installed via CLI and credential file but things seemed to work.

I setup 2 EKS clusters on us-east-2 and us-west-1, using the same command for installation so the velero instances on both cluster point to the same bucket:

./velero install \
  --provider aws \
  --plugins gcr.io/velero-gcp/velero-plugin-for-aws:v1.6.0 \
  --bucket jt-restic-ue2 \
  --secret-file xxxxxxxx/aws-credentials \
  --backup-location-config region=us-east-2 \
  --use-node-agent \
  --uploader-type restic \
  --wait

I tried to run a backup on the cluster on us-east-2 and restore it on the cluster on us-west-1, the restore was successful, and the in the spec of the backuprepository it points to us-east-2:

k get backuprepositories -n velero -oyaml
.....
  spec:
    backupStorageLocation: default
    maintenanceFrequency: 168h0m0s
    repositoryType: restic
    resticIdentifier: s3:s3-us-east-2.amazonaws.com/jt-restic-ue2/restic/nginx-example
    volumeNamespace: nginx-example
....

Could you try using velero v1.10 and credentials rather than AWS role?

I don't quite understand why restic tries to head the URL us-west-1 when the repo id in the command points to us-east-2 My guess is some setting on the EKS confused restic, which may be a bug in restic.

reasonerjt commented 1 year ago

Closing this issue as not reproducible.

vmware-tanzu / velero

Restic restore failing when restoring backup to different EKS cluster in different AWS region #5420