Closed archmangler closed 4 years ago
@archmangler based on the following log line:
time="2020-03-06T05:00:31Z" level=info msg="Skipping persistent volume snapshot because volume has already been backed up with restic." backup=velero/backup-schedule-frequent-snapshot-test-20200306050028 group=v1 logSource="pkg/backup/item_backupper.go:393" name=pvc-75f44b4c-5f64-11ea-9866-929bf458f004 namespace=default persistentVolume=pvc-75f44b4c-5f64-11ea-9866-929bf458f004 resource=pods
It doesn't sound like you're taking managed snapshots, you're using restic, in which case the VolumeSnapshotLocation
is irrelevant. Please clarify.
Hi @skriss - You are correct, I had installed with --restic, and had the following annotation:
backup.velero.io/backup-volumes: <volumename>
However, even in this case, how can I tell restic to use a designated resource group as the location to store its snapshots?
Next, I have now reinstalled without the restic plugin, and my intention is to have "Velero Managed Disk Snapshots" without restic. Now I am not getting any snapshots, restic or otherwise:
/usr/local/bin/velero install --provider azure \
--bucket "lolcorpaz1aksbkp" \
--secret-file "velero-credentials" \
--image "velero/velero:v1.1.0" \
--backup-location-config \
resourceGroup="rsg-lolcorp-uat-az1-aksbkp",storageAccount="stalolcorpuataz1aksbkp"
--snapshot-location-config \
apiTimeout="1m",resourceGroup="rsg-lolcorp-uat-az1-aksbkp" \
--velero-pod-cpu-limit "0" \
--velero-pod-cpu-request "0" \
--velero-pod-mem-limit "0" \
--velero-pod-mem-request "0"
--wait"
apiVersion: v1
kind: Pod
metadata:
name: mdvelerotest3
namespace: default
spec:
containers:
- args:
- "10000"
command:
- sleep
image: velero/velero:v1.1.0
imagePullPolicy: IfNotPresent
name: testmd2
volumeMounts:
- mountPath: "/mnt/"
name: mdstorage3
volumes:
- name: mdstorage3
persistentVolumeClaim:
claimName: mdsnapshotest3
imagePullSecrets:
- name: docker-release.lolcorp
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mdsnapshotest3
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: "lolcorp-aks-managed-disk"
resources:
requests:
storage: 100G
NOTE: I'm running pods with both Azure Files and Managed disk PVs and need snapshots for both. My understanding is:
However, this raises new questions:
a) If I install the restic plugin for azure files snapshot, how do I exclude managed disk from restic snapshotting but include it in veleros normal snapshotting? b) How do I tell restic to use the storage account I specify? c) The managed disk snapshots are not happening without restic installed, how to enable these (without restic)?
Ideally, I'd like a configuration that allows:
a) Azure files snapshots (with restic as I understand this the only way to get this) b) Managed disk snapshots (with any other velero mechanism) c) All snapshots go to a Resource Group I specify and not the MC_ AKS resource group.
My understanding is: I can only get snapshots for azure files with the restic plugin (hence --restic option at install time), but I can get snapshots for managed disk without installing restic
That is correct.
a) If I install the restic plugin for azure files snapshot, how do I exclude managed disk from restic snapshotting but include it in veleros normal snapshotting?
You will get your desired behavior by default, assuming Velero is configured correctly. Specifically, as long as you don't add the restic annotation (backup.velero.io/backup-volumes
) to the pod, the PV will be snapshotted by default (again, assuming things are configured correctly).
b) How do I tell restic to use the storage account I specify?
All restic data is stored in the same bucket/blob container as the rest of the main velero backup data/metadata. That can be in whatever storage account you want. You specify it via the config.storageAccount
field on the BackupStorageLocation
-- see https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/blob/master/backupstoragelocation.md.
c) The managed disk snapshots are not happening without restic installed, how to enable these (without restic)?
Something is likely misconfigured. Can you provide (preferably in a gist):
velero snapshot-location get -o yaml
velero plugin get
velero backup get <BACKUP-NAME> -o yaml
for the backup where you expected a snapshot, but did not get itvelero backup logs <BACKUP-NAME>
for the same backupvelero backup describe <BACKUP-NAME> --details
for the same backupThat should help us start debugging. Thanks!
Hi @skriss - I've pasted the information you requested here:
https://gist.github.com/archmangler/31397f8f56728d1880ad9ad526010d84
@archmangler thanks for the info.
I see that your Azure managed disk PV is named pvc-0f862e2f-53bd-4500-b563-23548c935fd5
.
In your logs, I see:
time="2020-03-09T02:51:41Z" level=info msg="Skipping persistent volume snapshot because volume has already been backed up with restic." backup=velero/test-backup group=v1 logSource="pkg/backup/item_backupper.go:393" name=pvc-0f862e2f-53bd-4500-b563-23548c935fd5 namespace=backuptests persistentVolume=pvc-0f862e2f-53bd-4500-b563-23548c935fd5 resource=pods
This tells me that in the pod that uses this PV, you still have an annotation indicating that this volume should be backed up with restic, i.e. backup.velero.io/backup-volumes: <VOLUME-NAME>
. You need to remove the managed disk volume's name from this annotation in order for it to be snapshotted natively.
Separately, I see that the Azure File volume is also annotated to be backed up with restic, but you don't have the restic daemonset installed so it's never getting processed. You need to either remove the annotation (on the pod that uses the volume, backup.velero.io/backup-volumes: <VOLUME-NAME>
) so Velero doesn't attempt to back up this volume with restic, OR you need to install the restic daemonset (--use-restic
flag to velero install
command).
@archmangler were you able to resolve this?
Hi @skriss - I've been rebuilding my test set up to reproduce this issue on a clean cluster. I will paste the new results tomorrow.
Hi @skriss - I've pasted debug information here from a fresh cluster:
https://github.com/vmware-tanzu/velero/issues/2328
Notes:
17:42:18 backups/backup-schedule-daily7d168h-20200313191036/backup-schedule-daily7d168h-20200313191036-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-13T19:10:41+00:00
17:42:18 backups/backup-schedule-daily7d168h-20200318174118/backup-schedule-daily7d168h-20200318174118-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-18T17:41:24+00:00
17:42:18 backups/backup-schedule-daily7d168h-20200319010008/backup-schedule-daily7d168h-20200319010008-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-19T07:10:14+00:00
17:42:18 backups/backup-schedule-daily7d168h-20200319092353/backup-schedule-daily7d168h-20200319092353-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-19T09:23:58+00:00
17:42:18 backups/backup-schedule-sunday7d168h-20200318174121/backup-schedule-sunday7d168h-20200318174121-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-18T17:41:29+00:00
17:42:18 backups/backup-schedule-sunday7d168h-20200319092355/backup-schedule-sunday7d168h-20200319092355-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-19T09:24:04+00:00
17:42:18 backups/backup-schedule-wednesday7d168h-20200318174121/backup-schedule-wednesday7d168h-20200318174121-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-18T17:41:35+00:00
17:42:18 backups/backup-schedule-wednesday7d168h-20200319092355/backup-schedule-wednesday7d168h-20200319092355-volumesnapshots.json.gz BlockBlob 29 application/octet-stream 2020-03-19T09:24:09+00:00
/velero install --provider azure --bucket plnszcaz1aksbkp --secret-file velero-credentials --image docker-release.lolcorp.lolcorp.com:8443/velero/velero:v1.1.0 --backup-location-config resourceGroup=rsg-lolcorp-dev-az1-aksbkp,storageAccount=stalolcorpdevaz1aksbkp --snapshot-location-config apiTimeout=1m,resourceGroup=rsg-lolcorp-dev-az1-aksbkp --velero-pod-cpu-limit 0 --velero-pod-cpu-request 0 --velero-pod-mem-limit 0 --velero-pod-mem-request 0 –wait
me="2020-03-19T10:37:21Z" level=info msg="Adding pvc azurefilesnapshotest2 to additionalItems" backup=velero/backuptests-lite cmd=/v
elero logSource="pkg/backup/pod_action.go:67" pluginName=velero
time="2020-03-19T10:37:21Z" level=info msg="Backing up item" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/item_backup
per.go:162" name=azurefilesnapshotest2 namespace=backuptests resource=pods
time="2020-03-19T10:37:21Z" level=info msg="Executing custom action" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/ite
m_backupper.go:310" name=azurefilesnapshotest2 namespace=backuptests resource=pods
time="2020-03-19T10:37:21Z" level=info msg="Executing takePVSnapshot" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/it
em_backupper.go:375" name=pvc-d1c1a9d2-69cb-11ea-94a1-5ebe98412ed4 namespace=backuptests resource=pods
time="2020-03-19T10:37:21Z" level=info msg="Skipping persistent volume snapshot because volume has already been backed up with restic.
" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/item_backupper.go:393" name=pvc-d1c1a9d2-69cb-11ea-94a1-5ebe98412ed4 n
amespace=backuptests persistentVolume=pvc-d1c1a9d2-69cb-11ea-94a1-5ebe98412ed4 resource=pods
time="2020-03-19T11:37:21Z" level=info msg="Adding pvc mdsnapshotest to additionalItems" backup=velero/backuptests-lite cmd=/velero lo
gSource="pkg/backup/pod_action.go:67" pluginName=velero
time="2020-03-19T11:37:22Z" level=info msg="Backing up item" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/item_backup
per.go:162" name=mdsnapshotest namespace=backuptests resource=pods
time="2020-03-19T11:37:22Z" level=info msg="Executing custom action" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/ite
m_backupper.go:310" name=mdsnapshotest namespace=backuptests resource=pods
time="2020-03-19T11:37:22Z" level=info msg="Executing takePVSnapshot" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/it
em_backupper.go:375" name=pvc-c7d25291-69cb-11ea-94a1-5ebe98412ed4 namespace=backuptests resource=pods
time="2020-03-19T11:37:22Z" level=info msg="Skipping persistent volume snapshot because volume has already been backed up with restic.
" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/item_backupper.go:393" name=pvc-c7d25291-69cb-11ea-94a1-5ebe98412ed4 n
amespace=backuptests persistentVolume=pvc-c7d25291-69cb-11ea-94a1-5ebe98412ed4 resource=pods
time="2020-03-19T11:37:22Z" level=info msg="Adding pvc mdsnapshotest2 to additionalItems" backup=velero/backuptests-lite cmd=/velero l
ogSource="pkg/backup/pod_action.go:67" pluginName=velero
time="2020-03-19T11:37:22Z" level=info msg="Backing up item" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/item_backup
per.go:162" name=mdsnapshotest2 namespace=backuptests resource=pods
time="2020-03-19T11:37:22Z" level=info msg="Executing custom action" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/ite
m_backupper.go:310" name=mdsnapshotest2 namespace=backuptests resource=pods
time="2020-03-19T11:37:22Z" level=info msg="Executing takePVSnapshot" backup=velero/backuptests-lite group=v1 logSource="pkg/backup/it
em_backupper.go:375" name=pvc-ca06433c-69cb-11ea-94a1-5ebe98412ed4 namespace=backuptests resource=pods
time="2020-03-19T11:37:23Z" level=info msg="Got volume ID for persistent volume" backup=velero/backuptests-lite group=v1 logSource="pk
g/backup/item_backupper.go:426" name=pvc-ca06433c-69cb-11ea-94a1-5ebe98412ed4 namespace=backuptests persistentVolume=pvc-ca06433c-69cb
-11ea-94a1-5ebe98412ed4 resource=pods volumeSnapshotLocation=default
Two things:
time="2020-03-19T11:37:23Z" level=error msg="Error backing up item" backup=velero/backuptests-lite error="error getting volume info: r
pc error: code = Unknown desc = compute.DisksClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azu
re: Service returned an error. Status=404 Code=\"ResourceNotFound\" Message=\"The Resource 'Microsoft.Compute/disks/kubernetes-dynamic
-pvc-ca06433c-69cb-11ea-94a1-5ebe98412ed4' under resource group 'rsg-lolcorp-dev-az1-aksbkp' was not found.\"" group=v1 logSource="pkg/
backup/resource_backupper.go:264" name=mdvelerotest2 namespace=backuptests resource=pods
This implies that in the secret, you set AZURE_RESOURCE_GROUP
to rsg-lolcorp-dev-az1-aksbkp
, not the AKS auto-generated resource group where your disks actually are. This is incorrect.
Please see the documentation for setting this up correctly: https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure#get-resource-group-for-persistent-volume-snapshots
Hi @skriss - I 've corrected the resource group in the velero credential file and removed the annotation in the 2 pods I deployed (before running the backups). Here is the debug output from a a fresh installation:
https://gist.github.com/archmangler/50e9b50ff212427f540e62b5b263ab66
apiVersion: v1
kind: Pod
metadata:
name: afsvelerotest
namespace: backuptests
spec:
containers:
args:
command:
image: docker-release.lolcorp.lolcorp.coma:8443/velero/velero:v1.1.0
imagePullPolicy: IfNotPresent
name: test
volumeMounts:
mountPath: "/mnt/"
name: afsstorage
volumes:
name: afsstorage
persistentVolumeClaim:
claimName: azurefilesnapshotest2
imagePullSecrets:
name: docker-release.lolcorp.lolcorp.coma
apiVersion: v1
kind: Pod
metadata:
name: afsvelerotest
namespace: backuptests
spec:
containers:
args:
command:
image: docker-release.lolcorp.lolcorp.coma:8443/velero/velero:v1.1.0
imagePullPolicy: IfNotPresent
name: test
volumeMounts:
mountPath: "/mnt/"
name: afsstorage
volumes:
name: afsstorage
persistentVolumeClaim:
claimName: azurefilesnapshotest2
imagePullSecrets:
name: docker-release.lolcorp.lolcorp.coma
Can you paste the full output of kubectl -n backuptests get pods -o yaml
?
Hi @skriss - as below
Arg, I see It. Let me fix that.
apiVersion: v1
items:
- apiVersion: v1
kind: Pod
metadata:
annotations:
backup.velero.io/backup-volumes: afsstorage
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"afsvelerotest","namespace":"backuptests"},"spec":{"containers":[{"args":["100000"],"command":["sleep"],"image":"docker-release.lolcorp.lolcorp.com:8443/velero/velero:v1.1.0","imagePullPolicy":"IfNotPresent","name":"test","volumeMounts":[{"mountPath":"/mnt/","name":"afsstorage"}]}],"imagePullSecrets":[{"name":"docker-release.lolcorp.lolcorp.com"}],"volumes":[{"name":"afsstorage","persistentVolumeClaim":{"claimName":"azurefilesnapshotest2"}}]}}
kubernetes.io/psp: privileged
creationTimestamp: "2020-03-19T16:40:10Z"
name: afsvelerotest
namespace: backuptests
resourceVersion: "5360"
selfLink: /api/v1/namespaces/backuptests/pods/afsvelerotest
uid: 4c50ffdf-6a00-11ea-a0f5-4681ac8d14e2
spec:
containers:
- args:
- "100000"
command:
- sleep
image: docker-release.lolcorp.comlolcorp.lolcorp.com:8443/velero/velero:v1.1.0
imagePullPolicy: IfNotPresent
name: test
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/
name: afsstorage
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-szkz2
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: docker-release.lolcorp.lolcorp.com
nodeName: aks-aksaz1np0-22735939-vmss000000
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: afsstorage
persistentVolumeClaim:
claimName: azurefilesnapshotest2
- name: default-token-szkz2
secret:
defaultMode: 420
secretName: default-token-szkz2
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:40:31Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:40:34Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:40:34Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:40:31Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://ec220bf7b8b367f12b84cfd3e0258fbebcea7154d6121af0c5d329c167ea4cc9
image: docker-release.lolcorp.lolcorp.com:8443/velero/velero:v1.1.0
imageID: docker-pullable://docker-release.lolcorp.lolcorp.com:8443/velero/velero@sha256:e35ea9ebcaaa4c4d256a04698b2c337cf8f10d2cc359497468014e4a7e39ee19
lastState: {}
name: test
ready: true
restartCount: 0
state:
running:
startedAt: "2020-03-19T16:40:33Z"
hostIP: 10.155.240.4
phase: Running
podIP: 10.155.240.54
qosClass: BestEffort
startTime: "2020-03-19T16:40:31Z"
- apiVersion: v1
kind: Pod
metadata:
annotations:
backup.velero.io/backup-volumes: mdstorage
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"mdvelerotest","namespace":"backuptests"},"spec":{"containers":[{"args":["100000"],"command":["sleep"],"image":"docker-release.lolcorp.lolcorp.com:8443/velero/velero:v1.1.0","imagePullPolicy":"IfNotPresent","name":"testmd","volumeMounts":[{"mountPath":"/mnt/","name":"mdstorage"}]}],"imagePullSecrets":[{"name":"docker-release.lolcorp.lolcorp.com"}],"volumes":[{"name":"mdstorage","persistentVolumeClaim":{"claimName":"mdsnapshotest"}}]}}
kubernetes.io/psp: privileged
creationTimestamp: "2020-03-19T16:39:52Z"
name: mdvelerotest
namespace: backuptests
resourceVersion: "5437"
selfLink: /api/v1/namespaces/backuptests/pods/mdvelerotest
uid: 41e8471b-6a00-11ea-a0f5-4681ac8d14e2
spec:
containers:
- args:
- "100000"
command:
- sleep
image: docker-release.lolcorp.lolcorp.com:8443/velero/velero:v1.1.0
imagePullPolicy: IfNotPresent
name: testmd
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /mnt/
name: mdstorage
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-szkz2
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
imagePullSecrets:
- name: docker-release.lolcorp.lolcorp.com
nodeName: aks-aksaz1np0-22735939-vmss000001
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: mdstorage
persistentVolumeClaim:
claimName: mdsnapshotest
- name: default-token-szkz2
secret:
defaultMode: 420
secretName: default-token-szkz2
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:40:26Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:41:46Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:41:46Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-03-19T16:40:26Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://1ce5c86b5cf7b96e1d9f9d75a063c8ff4936aa87400c5ee4915d72369d4aa4b9
image: docker-release.lolcorp.lolcorp.com:8443/velero/velero:v1.1.0
imageID: docker-pullable://docker-release.lolcorp.lolcorp.com:8443/velero/velero@sha256:e35ea9ebcaaa4c4d256a04698b2c337cf8f10d2cc359497468014e4a7e39ee19
lastState: {}
name: testmd
ready: true
restartCount: 0
state:
running:
startedAt: "2020-03-19T16:41:46Z"
hostIP: 10.155.240.55
phase: Running
podIP: 10.155.240.70
qosClass: BestEffort
startTime: "2020-03-19T16:40:26Z"
kind: List
metadata:
resourceVersion: ""
selfLink: ""
hi @skriss - new output, this time with annotations carefully removed from both test pods:
https://gist.github.com/archmangler/e881dcfb31841e1b31f2a75186acbfec
🎉 looks like you got a snapshot for your managed disk - if you do velero backup describe backuptests-lite --details
, you'll be able to see the snapshot identifier and confirm which resource group it ended up in.
Brilliant! And in the right resource group! Many thanks @skriss this resolves my issue.
awesome, glad you got it working :) I'll close this out.
Does this work across BackupStorageLocation is a different region, or the backup of the disk, and then the restore, in order to support DR?
Yes, file system backups can go to a BSL in a different region and used to perform DR.
What steps did you take and what happened:
Install and configuration:
What did you expect to happen:
The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
velero version
):velero client config get features
):kubectl version
):/etc/os-release
):