vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.6k stars 1.39k forks source link

Allow `velero install` to specify tolerations for restic daemonset #2898

Open jbmassicotte opened 4 years ago

jbmassicotte commented 4 years ago

Edited this earlier post of mine given the more recent info I’ve gathered.

What I did

velero install \ --provider azure \ --plugins velero/velero-plugin-for-microsoft-azure:v1.1.0 \ --bucket $BLOB_CONTAINER \ --secret-file $CREDENTIAL_FILE \ --backup-location-config resourceGroup=$AZURE_BACKUP_RESOURCE_GROUP,storageAccount=$AZURE_STORAGE_ACCOUNT \ --snapshot-location-config apiTimeout=$API_TIMEOUT,resourceGroup=$AZURE_BACKUP_RESOURCE_GROUP \ --use-restic

- I am using restic because my app mounts an AzureFile volume (it also mounts 3 ManagedDisk volumes but these are supported natively by velero)
- I added the AzureFile volume name to the app pod annotation, as required by restic (`backup.velero.io/backup-volumes: <volumename>`)
- I also added the mountOptions nouser_xattr to the AzureFile storageclass, again, as required by restic
- Attempted to create a backup: `velero backup create backup1 --include-namespaces mynamespace`

**The problem**
- `velero backup describe backup1 –details` shows process stuck InProgress, no error, no warning. See attached file.
- last log from `kubectl logs deployment/velero -n velero` says 'Initializing restic repository'

**What did you expect to happen:**
The backup to complete

**Anything else you would like to add:**

- I can see in Azure Portal that velero created a folder called restic under the Azure container, so I know the container location is valid
- I tried removing the AzureFile volume name from the pod annotation, restarted velero, with the use-restic flag still on, and the backup succeeded this time, which points to restic as the culprit.
**BUT**: I also tried removing the use-restic flag (checked that the restic daemonset was not started), added the pod annotation back, and check that: the backup failed with the same "Initializing restic repo" condition. What's up with that!?
- I am starting to believe this is a bug, so please prove me wrong

**Environment:**

$ kubectl version --short Client Version: v1.15.10 Server Version: v1.17.9 $ velero client config get features features: $ velero version Client: Version: v1.4.2 Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5 Server: Version: v1.4.2


[create-backup.txt](https://github.com/vmware-tanzu/velero/files/5210944/create-backup.txt)
jbmassicotte commented 4 years ago

We figured out our problem: our cluster is composed of 3 nodepools, the default, plus let’s say pool A and B. We have 2 applications, let’s say X and Y, and use ‘tolerations’ to force app X on nodepool A, and app Y on nodepool B. Because restic uses no toleration, it runs on default nodepool and fails to backup volumes from applications running on pool A and B.

To fix the problem (temporarily), I used kubectl edit daemonset/restic -n velero to add the needed toleration, which forced restic to run on all cluster nodes. Subsequent backups worked.

Questions to the Velero experts: I need to make these changes permanent. How can I provide these changes to the ‘velero install’ command? Is there a way to provide a daemonset-restic.yaml file to ‘velero install’, and if so, where can I find the default file which I will use to add the toleration config?

jbmassicotte commented 4 years ago

I ended up writing a script to capture the daemonset yaml config, to add the toleration to this config via a sequence of sed updates, and to invoke ‘kubectl replace’ with the updated config. It does the trick but I find that somewhat cheesy. Any solution deemed more elegant and reliable would be appreciated.

JarnoRFB commented 3 years ago

@jbmassicotte In case you can use the velero helm chart instead, it is possible to specify tolerations for the daemonset there https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml#L269.

As this tripped me off a bit, when trying to do a restic backup on a pod that was running on a node where no restic daemon was running I think it would be good behavior if the backup would raise an error or at least show a warning in the velero logs in this situation.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

JarnoRFB commented 3 years ago

Should this be unstaled, as it already has been marked as valuable?

arunvc commented 2 years ago

Backup stuck using restic with out any clue, InProgress status. velero install has no option for tolerations.

Thanks to @jbmassicotte Manual editing works

kubectl edit daemonset/restic -n velero

Eg

      tolerations:
      - key: cpu
        operator: Equal
        value: mydb
        effect: NoSchedule