vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.45k stars 1.37k forks source link

Restic backups not restored #4072

Open nyarly opened 2 years ago

nyarly commented 2 years ago

What steps did you take and what happened:

I created a cluster to restore a backup into. Ran a restore; the Velero restore worked fine, but no Restic volumes were restored. Tore down restore cluster and replaced it, without restoring.

Based on https://github.com/vmware-tanzu/velero/blob/main/site/content/docs/v1.5/restic.md#troubleshooting: Neither instance of the restore cluster has ResticRepository objects, despite the source cluster having repositories, and all 3 clusters being deployed in the same GCP project. The restore clusters have PodVolumeBackup resources that match the source cluster.

What did you expect to happen:

Volumes backed up with Restic to be restored.

The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.) Velero is deployed into a different namespace in these clusters.

This instance of the restore cluster hasn't had a restore; I'll update after re-attempting.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 2 years ago

Checking the log of the back, there are many errors like this:

time="2021-08-26T16:30:34Z" level=error msg="Error backing up item" backup=management/backup-20210826092545 error="pod volume backup failed: error running restic backup, stderr=
{\"message_type\":\"error\",\"error\":{\"Op\":\"lstat\",\"Path\":\"nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdm\",\"Err\":2},\"during\":\"archival\",\"item\":\"/host_pods/147ad440-f30c-4102-840f-375bf5410f24/volumes/kubernetes.io~csi/pvc-771113e7-6153-4fad-83df-dd4adae5bc89/mount/nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdm\"}
\n
{\"message_type\":\"error\",\"error\":{\"Op\":\"lstat\",\"Path\":\"nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdt\",\"Err\":2},\"during\":\"archival\",\"item\":\"/host_pods/147ad440-f30c-4102-840f-375bf5410f24/volumes/kubernetes.io~csi/pvc-771113e7-6153-4fad-83df-dd4adae5bc89/mount/nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdt\"}\
......

So the problem is that the restic backup failed.

Could you check the log of restic daemon set at the same time frame?

looks very similar to #4065 I'm currently not quite sure how restic works but I notice during:archival I'll double check if it mean restic may scan the files first then try to archive the files, and during that timeframe the files are gone.

nyarly commented 2 years ago

In the describe for the backup, there are similarly failures listed for volumes associated with ElasticSearch. If there's a scan-then-backup situation, it may be that ES is shuffling files around under Restic. In the final analysis, if we can't restore ES, that wouldn't be the end of the world.

There are successful backups of volumes for e.g. Cockroach and Vault though, and those we can't do without. But those restores aren't happening at all.

Do you think some failed volume backups are preventing all volumes from restoring?

nyarly commented 2 years ago

More generally, is there a way to get Velero to take a snapshot and have Restic back that up? "snapshot, backup, release" is a pretty common pattern, and I was surprised that's not what Velero/Restic seem to do, unless I misunderstand

reasonerjt commented 2 years ago

@nyarly Thanks for the reply, let me clarify, there are two issues in your case.

  1. As for the failures in backup:

    If there's a scan-then-backup situation ...

I'm relatively new to this project and this is my guess, I'll discuss with other maintainers to understand more details. But could you collect the logs of the restic pod when lstat report errors? That may help me further debug this issue.

  1. As for why the restore failed, I noticed there are a few connection errors in velero server's log:
    time="2021-08-27T21:40:43Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:85"
    W0827 22:18:53.227257       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PersistentVolumeClaim ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
    W0827 22:18:53.227421       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PersistentVolume ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
    W0827 22:18:53.227424       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PodVolumeBackup ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
    W0827 22:18:53.227536       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Pod ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
    W0827 22:18:53.227667       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PodVolumeRestore ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
    E0827 22:18:54.076681       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://10.128.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-t0-qa-restore-default-pool-b9a33f02-hbpl&resourceVersion=26768": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:54.092065       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://10.128.0.1:443/api/v1/persistentvolumes?resourceVersion=26631": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:54.504478       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeBackup: failed to list *v1.PodVolumeBackup: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumebackups?resourceVersion=26852": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:54.832075       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeRestore: failed to list *v1.PodVolumeRestore: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumerestores?resourceVersion=26675": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:54.832134       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Get "https://10.128.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=26873": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:55.970490       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://10.128.0.1:443/api/v1/persistentvolumes?resourceVersion=26631": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:56.160329       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://10.128.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-t0-qa-restore-default-pool-b9a33f02-hbpl&resourceVersion=26768": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:56.880096       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Get "https://10.128.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=26873": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:56.971595       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeBackup: failed to list *v1.PodVolumeBackup: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumebackups?resourceVersion=26852": dial tcp 10.128.0.1:443: connect: connection refused
    E0827 22:18:57.304027       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeRestore: failed to list *v1.PodVolumeRestore: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumerestores?resourceVersion=26675": dial tcp 10.128.0.1:443: connect: connection refused
    I0827 22:19:30.525438       1 trace.go:205] Trace[776971353]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156 (27-Aug-2021 22:19:00.524) (total time: 30000ms):
    Trace[776971353]: [30.000509909s] [30.000509909s] END
    E0827 22:19:30.525490       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://10.128.0.1:443/api/v1/persistentvolumes?resourceVersion=26631": dial tcp 10.128.0.1:443: i/o timeout
    I0827 22:19:30.892186       1 trace.go:205] Trace[638043721]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156 (27-Aug-2021 22:19:00.891) (total time: 30000ms):
    Trace[638043721]: [30.00053558s] [30.00053558s] END
    E08

    Is that when you tried to restore?

nyarly commented 2 years ago

Restic logs:

That does seem to be when the restore was attempted: backup-20210826092545-20210827172438

dsu-igeek commented 2 years ago

@nyarly

More generally, is there a way to get Velero to take a snapshot and have Restic back that up? "snapshot, backup, release" is a pretty common pattern, and I was surprised that's not what Velero/Restic seem to do, unless I misunderstand

We're looking into adding this in a future release. With the existing backups available via K8s, we usually don't get a snapshot that is quickly available as a filesystem, so we need to clone the snapshot into a volume and then back it up which is going to take some time so we'll probably have that as an option, but not the primary option. Most of the snapshot mechanisms available to us allow direct access to the snapshot blocks and changed block tracking, so block level backups are significantly faster and we'll be adding more support for using those and pushing to alternate storage.

crdemon09 commented 2 years ago

Any news on this one?