vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.71k stars 1.4k forks source link

Unable to restore PersistentVolumes due to x509 Errors #3678

Open boluisa opened 3 years ago

boluisa commented 3 years ago

What steps did you take and what happened:

We run into the following issue when we perform velero restore operations in AWS Isolated environment. We are also passing a CA cert to communicate with the Cloud service provider.

We are able to perform Velero Backup operations successfully but restore doesn't work.

The restores operation fail with the following approximate error:

error executing PVAction for persistent volumes
Caused by POST https://ec2.region.gov    x509 certificate signed by unknown authority

We're using the following install command, using the velero CLI, to create a YAML manifest which we edit slightly to add in additional proxy environment variables to velero.

velero install --dry-run \
               --output yaml \
               --provider aws \
               --bucket velero-backups \
               --backup-location-config region=us-gov-west-1 \
               --snapshot-location-config region=us-gov-west-1 \
               --image <velero v.1.5.3> \
               --plugins  <velero-plugin-for-aws:v1.2.0> \
               --no-secret \
               --prefix blue-k8s-cluster \
               --cacert /path/to/private/bundle.crt > velero.yaml

What did you expect to happen: We expected velero restore to complete successfully with workloads restored.

Environment:

Velero Client Version: 1.5.4 Kubernetes 1.18

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

salt-mountain commented 3 years ago

Apologies for the delay on additional information.

This is happening in an AWS environment in an isolated region. Retrieving logs from this environment is quite difficult but I'm happy to answer any questions to the best of my abilities, even if getting answers can take a little bit longer. The various service endpoints for this environment are slightly different than the standard Commercial or even GovCloud endpoints and they typically require a specific set of CA Certificates to be added to requests for them to be able to go through.

We initially did what Tunde pasted above -- we generated a velero.yaml with the above command and then manually kicked it off with a kubectl create -f velero.yaml. The velero CLI was able to successfully connect to AWS and we were able to see the backup bucket we had chosen. We were able to also kick off backup jobs that would say they completed successfully and we would see that there were files in the S3 Buckets.

We ran into issues actually trying to attempt to run a restore job. It would Partially Fail and report back as stated above:

error executing PVAction for persistent volumes
Caused by POST https://ec2.isolated-region.gov    x509 certificate signed by unknown authority

We had mistakenly believed that providing our custom CA certs with the --cacert flag would include those certificates throughout velero, not just on the BackupStorageLocation object. After some digging I found a potential solution where I could manually insert the certificates I wanted to be included by mounting them into the container and setting the path to the certs in the container to the AWS_CA_BUNDLE environment variable. So I tried that.

$ kubectl create namespace velero
$ kubectl create configmap cert-bundle --from-file=/home/maintuser/ca-bundle.crt -n velero
# The key of the configmap, unless you specify one, should be the name of the file you generate the configMap from, ca-bundle.crt in our case.
# But let's verify
$ kubectl describe configmap cert-bundle -n velero

I then edited my velero.yaml to mount the configmap as a volume, then to the container and then finally setting the environment variable. Here's what that ended up looking like.

spec:
    template:
      spec:
        containers:
        - args:
          - server
          - --features=
          command:
          - /velero
          env:
          - name: VELERO_SCRATCH_DIR
            value: /scratch
          - name: AWS_CA_BUNDLE
            value: /etc/ssl/certs/ca-bundle.crt
          - name: VELERO_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: LD_LIBRARY_PATH
            value: /plugins
          image: localhost:5000/opensource/velero/velero:v1.5.3
          imagePullPolicy: Always
          name: velero
          ports:
          - containerPort: 8085
            name: metrics
          resources:
            limits:
              cpu: "1"
              memory: 512Mi
            requests:
              cpu: 500m
              memory: 128Mi
          volumeMounts:
          - mountPath: /plugins
            name: plugins
          - mountPath: /scratch
            name: scratch
          - mountPath: /etc/ssl/certs/ca-bundle.crt
            subPath: ca-bundle.crt
            name: certs
        initContainers:
        - image: localhost:5000/opensource/velero/velero-plugin-for-aws:v1.2.0
          imagePullPolicy: Always
          name: velero-plugin-for-aws
          resources: {}
          volumeMounts:
          - mountPath: /target
            name: plugins
          - mountPath: /etc/ssl/certs/ca-bundle.crt
            subPath: ca-bundle.crt
            name: certs
        restartPolicy: Always
        serviceAccountName: velero
        volumes:
        - emptyDir: {}
          name: plugins
        - emptyDir: {}
          name: scratch
        - configMap:
            name: cert-bundle
          name: certs

After using this new manifest to kick velero off, re-taking backups and trying restores, it finally worked.

Please let me know if you have any additional questions that I might be able to provide answers to for this issue.

kaovilai commented 11 months ago

Something to consider documenting into aws plugin readme.