Restic backups not restored

nyarly commented 2 years ago

What steps did you take and what happened:

I created a cluster to restore a backup into. Ran a restore; the Velero restore worked fine, but no Restic volumes were restored. Tore down restore cluster and replaced it, without restoring.

Based on https://github.com/vmware-tanzu/velero/blob/main/site/content/docs/v1.5/restic.md#troubleshooting: Neither instance of the restore cluster has ResticRepository objects, despite the source cluster having repositories, and all 3 clusters being deployed in the same GCP project. The restore clusters have PodVolumeBackup resources that match the source cluster.

What did you expect to happen:

Volumes backed up with Restic to be restored.

The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.) Velero is deployed into a different namespace in these clusters.

kubectl logs deployment/velero -n management https://gist.github.com/67a8e2d002b6426b1b531d38e31328fb
velero -n management backup describe <backupname> https://gist.github.com/5d3d6892fd7ee5f4f432f9765b9c9358
velero -n management backup logs <backupname> https://gist.github.com/2a3c4e28360f2e761a06edd7c03ef793

This instance of the restore cluster hasn't had a restore; I'll update after re-attempting.

velero -n management restore describe <restorename>
velero -n management restore logs <restorename>

Environment:

Velero version (use velero version):

Client:
    Version: 1.6.0
    Git commit: 5bd70fd8eef316d220317245e46dc6016c348dce
Server:
    Version: v1.6.3

Velero features (use velero client config get features): features: <NOT SET

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"archive", BuildDate:"1980-01-01T00:00:00Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3-gke.1100", GitCommit:"72f308e9235b8ea24391fd0d634bbb31789afefd", GitTreeState:"clean", BuildDate:"2021-07-30T09:24:29Z", GoVersion:"go1.16.1b7", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version: GKE
Cloud provider or hardware configuration: GCP
OS (e.g. from /etc/os-release): GCP Ubuntu

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

reasonerjt commented 2 years ago

Checking the log of the back, there are many errors like this:

time="2021-08-26T16:30:34Z" level=error msg="Error backing up item" backup=management/backup-20210826092545 error="pod volume backup failed: error running restic backup, stderr=
{\"message_type\":\"error\",\"error\":{\"Op\":\"lstat\",\"Path\":\"nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdm\",\"Err\":2},\"during\":\"archival\",\"item\":\"/host_pods/147ad440-f30c-4102-840f-375bf5410f24/volumes/kubernetes.io~csi/pvc-771113e7-6153-4fad-83df-dd4adae5bc89/mount/nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdm\"}
\n
{\"message_type\":\"error\",\"error\":{\"Op\":\"lstat\",\"Path\":\"nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdt\",\"Err\":2},\"during\":\"archival\",\"item\":\"/host_pods/147ad440-f30c-4102-840f-375bf5410f24/volumes/kubernetes.io~csi/pvc-771113e7-6153-4fad-83df-dd4adae5bc89/mount/nodes/0/indices/Xb1jihNcRR-wML2VEubkMA/0/index/_284.fdt\"}\
......

So the problem is that the restic backup failed.

Could you check the log of restic daemon set at the same time frame?

looks very similar to #4065 I'm currently not quite sure how restic works but I notice during:archival I'll double check if it mean restic may scan the files first then try to archive the files, and during that timeframe the files are gone.

nyarly commented 2 years ago

In the describe for the backup, there are similarly failures listed for volumes associated with ElasticSearch. If there's a scan-then-backup situation, it may be that ES is shuffling files around under Restic. In the final analysis, if we can't restore ES, that wouldn't be the end of the world.

There are successful backups of volumes for e.g. Cockroach and Vault though, and those we can't do without. But those restores aren't happening at all.

Do you think some failed volume backups are preventing all volumes from restoring?

nyarly commented 2 years ago

More generally, is there a way to get Velero to take a snapshot and have Restic back that up? "snapshot, backup, release" is a pretty common pattern, and I was surprised that's not what Velero/Restic seem to do, unless I misunderstand

reasonerjt commented 2 years ago

@nyarly Thanks for the reply, let me clarify, there are two issues in your case.

As for the failures in backup:

If there's a scan-then-backup situation ...

I'm relatively new to this project and this is my guess, I'll discuss with other maintainers to understand more details. But could you collect the logs of the restic pod when lstat report errors? That may help me further debug this issue.

As for why the restore failed, I noticed there are a few connection errors in velero server's log:

time="2021-08-27T21:40:43Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:85"
W0827 22:18:53.227257       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PersistentVolumeClaim ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
W0827 22:18:53.227421       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PersistentVolume ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
W0827 22:18:53.227424       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PodVolumeBackup ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
W0827 22:18:53.227536       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.Pod ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
W0827 22:18:53.227667       1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: watch of *v1.PodVolumeRestore ended with: very short watch: pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Unexpected watch close - watch lasted less than a second and no items received
E0827 22:18:54.076681       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://10.128.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-t0-qa-restore-default-pool-b9a33f02-hbpl&resourceVersion=26768": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:54.092065       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://10.128.0.1:443/api/v1/persistentvolumes?resourceVersion=26631": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:54.504478       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeBackup: failed to list *v1.PodVolumeBackup: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumebackups?resourceVersion=26852": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:54.832075       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeRestore: failed to list *v1.PodVolumeRestore: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumerestores?resourceVersion=26675": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:54.832134       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Get "https://10.128.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=26873": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:55.970490       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://10.128.0.1:443/api/v1/persistentvolumes?resourceVersion=26631": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:56.160329       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://10.128.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dgke-t0-qa-restore-default-pool-b9a33f02-hbpl&resourceVersion=26768": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:56.880096       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Get "https://10.128.0.1:443/api/v1/persistentvolumeclaims?resourceVersion=26873": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:56.971595       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeBackup: failed to list *v1.PodVolumeBackup: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumebackups?resourceVersion=26852": dial tcp 10.128.0.1:443: connect: connection refused
E0827 22:18:57.304027       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PodVolumeRestore: failed to list *v1.PodVolumeRestore: Get "https://10.128.0.1:443/apis/velero.io/v1/namespaces/management/podvolumerestores?resourceVersion=26675": dial tcp 10.128.0.1:443: connect: connection refused
I0827 22:19:30.525438       1 trace.go:205] Trace[776971353]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156 (27-Aug-2021 22:19:00.524) (total time: 30000ms):
Trace[776971353]: [30.000509909s] [30.000509909s] END
E0827 22:19:30.525490       1 reflector.go:127] pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Get "https://10.128.0.1:443/api/v1/persistentvolumes?resourceVersion=26631": dial tcp 10.128.0.1:443: i/o timeout
I0827 22:19:30.892186       1 trace.go:205] Trace[638043721]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.19.12/tools/cache/reflector.go:156 (27-Aug-2021 22:19:00.891) (total time: 30000ms):
Trace[638043721]: [30.00053558s] [30.00053558s] END
E08

Is that when you tried to restore?

nyarly commented 2 years ago

Restic logs:

That does seem to be when the restore was attempted: backup-20210826092545-20210827172438

dsu-igeek commented 2 years ago

@nyarly

More generally, is there a way to get Velero to take a snapshot and have Restic back that up? "snapshot, backup, release" is a pretty common pattern, and I was surprised that's not what Velero/Restic seem to do, unless I misunderstand

We're looking into adding this in a future release. With the existing backups available via K8s, we usually don't get a snapshot that is quickly available as a filesystem, so we need to clone the snapshot into a volume and then back it up which is going to take some time so we'll probably have that as an option, but not the primary option. Most of the snapshot mechanisms available to us allow direct access to the snapshot blocks and changed block tracking, so block level backups are significantly faster and we'll be adding more support for using those and pushing to alternate storage.

crdemon09 commented 2 years ago

Any news on this one?

vmware-tanzu / velero

Restic backups not restored #4072