vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.55k stars 1.38k forks source link

Unable to backup workload to s3 target with restic #4516

Open abarry-gn opened 2 years ago

abarry-gn commented 2 years ago

What steps did you take and what happened: I created my helm values.yaml file.

#--------------------------------------------------#
# Backup Storage Location Configuration Parameters #
#--------------------------------------------------#
configuration:
  provider: aws
  defaultVolumesToRestic: true
  backupStorageLocation:
    name: mybackupstoragelocationname
    bucket: mybucketname
    config:
      region: eu-west-3
      s3Url: "https://storagegrid-s3.example.com"
      s3ForcePathStyle: true
    default: true      
    caCert: "BASE64 ENCODED CERT VALUE IN ONE LINE"  

credentials:
  secretContents:
    cloud: |
      [default]
      aws_access_key_id=MY_S3_ACCESS_KEY
      aws_secret_access_key=MY_S3_SECRET_KEY
      output=json

#-------------------------------------------#
# Velero General Configuration Parameters #
#-------------------------------------------#
image:
  tag: v1.7.0

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.3.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

configMaps:
  restic-restore-action-config:
    labels:
      velero.io/plugin-config: ""
      velero.io/restic: RestoreItemAction
    data:
      image: velero/velero-restic-restore-helper:v1.7.0

kubectl:
  image:
    repository: bitnami/kubectl

snapshotsEnabled: false
deployRestic: true

Then I install velero with the following command:

helm install velero vmware-tanzu/velero --version 2.26.1 --namespace velero --create-namespace -f values.yaml

What did you expect to happen: I have 2 clusters. On both clusters velero is successfully installed. However backups works on the first clusters but not on the second one.

I would like the manual and scheduled backups to work on the second cluster also.

The output of the following commands will help us better understand what's going on: I get this kind of error when I run the velero backup command:

time="2021-11-29T08:00:39Z" level=error msg="Error checking repository for stale locks" controller=restic-repo error="error running command=restic unlock --repo=s3:https://storagegrid-s3.example.com/3p-velero/restic/3p --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cacert=/tmp/cacert-bsl-uro-3p914046061 --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?\ns3:https://storagegrid-s3.example.com/3p-velero/restic/3p\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:276" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:144" name=3p-bsl-uro-3p-72h2g namespace=velero
time="2021-11-29T08:00:39Z" level=error msg="Error checking repository for stale locks" controller=restic-repo error="error running command=restic unlock --repo=s3:https://storagegrid-s3.example.com/3p-velero/restic/cattle-prometheus --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cacert=/tmp/cacert-bsl-uro-3p692689128 --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?\ns3:https://storagegrid-s3.example.com/3p-velero/restic/cattle-prometheus\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:276" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:144" name=cattle-prometheus-bsl-uro-3p-rqqsm namespace=velero
time="2021-11-29T08:00:39Z" level=error msg="Error checking repository for stale locks" controller=restic-repo error="error running command=restic unlock --repo=s3:https://storagegrid-s3.example.com/3p-velero/restic/velero --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cacert=/tmp/cacert-bsl-uro-3p737709863 --cache-dir=/scratch/.cache/restic, stdout=, stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?\ns3:https://storagegrid-s3.example.com/3p-velero/restic/velero\n: exit status 1" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/repository_manager.go:276" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*repositoryManager).exec" logSource="pkg/controller/restic_repository_controller.go:144" name=velero-bsl-uro-3p-zxtvt namespace=velero

Environment:

qiuming-best commented 2 years ago

@barryboubakar, Here is a similar issue to yours. Maybe you could follow the step to clear the previous installation

qiuming-best commented 2 years ago

@barryboubakar could you provide more logs about velero and restic pod

abarry-gn commented 2 years ago

Hi @qiuming-best, I tried the steps suggested in the comments but it did not solve the issue. I still can't backup to the target.

I also uninstalled and re-installed velero. This fresh install has not changed the situation. Find attached the logs of the velero container when I created a new backup.

velero.log

@barryboubakar, Here is a similar issue to yours. Maybe you could follow the step to clear the previous installation

qiuming-best commented 2 years ago

@barryboubakar could you please open velero deployment and restic deamonset debug log logs , and provide velero pod and one restic pod debug logs(kubectl logs -n velero velero-xxx or restic-xxx)for us. because follow your step I couldn't reproduce the scenario,I need more logs to track the issue, thank you

abarry-gn commented 2 years ago

Hi @qiuming-best find attached the logs of velero and restic pods.

container-restic1.log container-restic2.log container-restic3.log container-velero.log

Thanks

qiuming-best commented 2 years ago

@barryboubakar, I've reproduced the same error with your provided logs "unable to open config file: Stat: The specified key does not exist" with the restic client(0.12.0). here is the restic setup document I followed to install restic. both the below two scenarios could get the same error.

  1. I configured the wrong repo address
  2. the right repo address, without doing restic init but directly executing restic unlock

So, there must be something wrong with the restic repo(s3:https://storagegrid-s3.example.com/3p-velero/restic/xxx), Maybe your two clusters using the same repo(same bucket)and same config or wrongly configured restic repo.

abarry-gn commented 2 years ago

@qiuming-best, normally I use the same S3 target with different bucket each time. These logs where from the production. Let me try it again in the lab and come back to you.

abarry-gn commented 2 years ago

Hello @qiuming-best, sorry for the delay of this response. I have tried again in the lab. Here are the output of the describe command.

velero backup describe instant-backup-test

Name:         instant-backup-test
Namespace:    velero
Labels:       velero.io/storage-location=bsl-opc-green-cluster
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.20.15
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=20

Phase:  PartiallyFailed (run `velero backup logs instant-backup-test` for more information)

Errors:    58
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  bsl-opc-green-cluster

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2022-03-14 22:26:33 +0100 CET
Completed:  2022-03-15 02:27:02 +0100 CET

Expiration:  2022-04-13 23:26:33 +0200 CEST

Total items to be backed up:  6078
Items backed up:              6078

Velero-Native Snapshots: <none included>

Restic Backups (specify --details for more information):
  Completed:  62
  Failed:     1
  New:        7

And find attached the logs of the velero container.

velero-586b788745-c9mrh_velero-1.log velero-586b788745-c9mrh_velero-2.log

Thank you for your help.

qiuming-best commented 2 years ago

@barryboubakar through the velero container logs you provided, there are some errors:

  1. error getting volume path on host: expected one matching path, got 0 which means velero cloud not find the directory on the host machine related to pvc pvc-58b3bd67-10f3-4542-93cb-d9226a0546a5(not in the pod), so check the volume related to the pvc.
  2. MissingRegion: could not find region configuration, maybe you can follow the readme of aws plugin with the backup storage location configuration.
  3. timed out waiting for all PodVolumeBackups to complete which is time out for doing volume backup with pod filesharing-64d86bd49c-hfpcr, so maybe you should make clear what kind of files in filesharing-64d86bd49c-hfpcr related volume.
  4. restic repository is not ready, there may be something abnormal with restic repo. you can check the log of restic pod.
qiuming-best commented 2 years ago

@barryboubakar what's your backup command is, have you ever set the storage-location like this velero backup create test --include-namespaces testns-0 --default-volumes-to-restic --storage-location default --wait

abarry-gn commented 2 years ago

@qiuming-best actually I am putting all these information in my yaml file if you scroll up:

  defaultVolumesToRestic: true
  backupStorageLocation:
[ ...]
    default: true      

The command I used to execute the backup is velero backup create mybackupname . I don't specify the namespace cause I want all namespaces to be backed up.

Now I can do backups and I understand why some PVC have failed.

In this exemple:

Total items to be backed up:  10773
Items backed up:              10773

Velero-Native Snapshots: <none included>

Restic Backups (specify --details for more information):
  Completed:  370
  Failed:     20

All objects (10773) are backed up. All 370 PVCs are backed up, and 20 of them have failed. After some analysis, I found out it is because some PVCs are connected to PODs that are in state "Completed" and some are just created and call in pod's volumes but not mounted in volumeMounts --> error error getting volume path on host: expected one matching path, got 0.

The only question remaining is how to exclude backups for PODs or objects that are "Completed" (e.g. the jobs) ?

truvira commented 2 years ago

Hi Barry,

It looks like it has been added to latest Velero release

abarry-gn commented 2 years ago

Hi @truvira, I have upgraded the server side velero to 1.8.1 (with helm chart 2.29.1) and the client side also to 1.8.1 and nothing has changed yet.

When I perform the command velero backup create it is still trying to backup volumes of PODs that are not in "Running" state.

Do I need to add an additional option ?

qiuming-best commented 2 years ago

@barryboubakar I've tested some pods with one pod that has PVC in CrashLoopBackOff status, It could be backup successful and skip backup CrashLoopBackOff pod. As you mentioned above it is still trying to backup volumes of PODs that are not in "Running" state", could you confirm it wouldn't skip the pod? (Skipping volume ... is printed in the restic pod on the nodes of not Running pods with debug log opened)

abarry-gn commented 2 years ago

Hello @qiuming-best , I did a new backup and the result is the same. No "Skipping volume" in restic logs. Find attached the restic logs.

How did you install your velero? Are you using the latest ?

BR. restic-ces-oks-worker-opc-blue-north1.txt restic-ces-oks-worker-opc-blue-north2.txt restic-ces-oks-worker-opc-blue-north3.txt restic-ces-oks-worker-opc-blue-south1.txt restic-ces-oks-worker-opc-blue-south2.txt restic-ces-oks-worker-opc-blue-south3.txt restic-ces-oks-worker-opc-blue-west1.txt restic-ces-oks-worker-opc-blue-west2.txt restic-ces-oks-worker-opc-blue-west3.txt restic-ces-oks-worker-opc-blue-west4.txt restic-ces-oks-worker-opc-blue-west5.txt restic-ces-oks-worker-opc-blue-west6.txt restic-ces-oks-worker-opc-blue-west7.txt

abarry-gn commented 2 years ago

Hi @qiuming-best any update on my last question?