vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.67k stars 1.4k forks source link

Go crash when restoring a schedule backup #1233

Closed PeterGrace closed 5 years ago

PeterGrace commented 5 years ago

What steps did you take and what happened: When attempting to restore a backup created in an Azure AKS kubernetes cluster, utilizing restic for disk backups, I received the below panic log, using v0.10.0. The restore then becomes stuck "InProgress" and will not resume nor clean itself up when the ark pod recreates.

What did you expect to happen:

I expected the data to be restored successfully.

The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.)

time="2019-02-22T19:57:08Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-2019022219314[8419/98282]
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Restoring ClusterRoleBinding: system:controller:disruption-controller" backup=cortex-ark-total-daily-20190222193143 logSource="p
kg/restore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Restoring ClusterRoleBinding: system:controller:endpoint-controller" backup=cortex-ark-total-daily-20190222193143 logSource="pkg
/restore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Restoring ClusterRoleBinding: system:controller:expand-controller" backup=cortex-ark-total-daily-20190222193143 logSource="pkg/r
estore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:08Z" level=info msg="Restoring ClusterRoleBinding: system:controller:generic-garbage-collector" backup=cortex-ark-total-daily-20190222193143 logSourc
e="pkg/restore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Restoring ClusterRoleBinding: system:controller:horizontal-pod-autoscaler" backup=cortex-ark-total-daily-20190222193143 logSourc
e="pkg/restore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Restoring ClusterRoleBinding: system:controller:job-controller" backup=cortex-ark-total-daily-20190222193143 logSource="pkg/rest
ore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Restoring ClusterRoleBinding: system:controller:namespace-controller" backup=cortex-ark-total-daily-20190222193143 logSource="pk
g/restore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Restoring ClusterRoleBinding: system:controller:node-controller" backup=cortex-ark-total-daily-20190222193143 logSource="pkg/res
tore/restore.go:796" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222193143 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/totality
time="2019-02-22T19:57:09Z" level=info msg="Restoring ClusterRoleBinding: system:controller:persistent-volume-binder" backup=cortex-ark-total-daily-20190222193143 logSource
="pkg/restore/restore.go:796" restore=heptio-ark/totality
E0222 19:57:09.608771       1 runtime.go:66] Observed a panic: "send on closed channel" (send on closed channel)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/runtime/chan.go:185
/usr/local/go/src/runtime/chan.go:125
/go/src/github.com/heptio/ark/pkg/restic/repository_ensurer.go:69
/go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/controller.go:202
/go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:552
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203
/go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546
/go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: send on closed channel [recovered]
        panic: send on closed channel

goroutine 265 [running]:
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x19f7580, 0x1f22260)
        /usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/heptio/ark/pkg/restic.newRepositoryEnsurer.func1(0x1ca7c80, 0xc0007ee780, 0x1ca7c80, 0xc000ae7380)
        /go/src/github.com/heptio/ark/pkg/restic/repository_ensurer.go:69 +0x2dc
github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(0x0, 0xc0006d9fe0, 0x0, 0x1ca7c80, 0xc0007ee780, 0x1ca7c80, 0xc000ae7380)
        /go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/controller.go:202 +0x5d
github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc00040bf30, 0x3, 0x3)
        /go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:552 +0x18b
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0xc000795e18, 0x42b002, 0xc0000d02b
0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:203 +0x9c
github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()
        /go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:548 +0x89
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00040bf68)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000795f68, 0xdf8475800, 0x0, 0x19a0901, 0xc0004a4c60)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc00040bf68, 0xdf8475800, 0xc0004a4c60)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc0006ddd80)
        /go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:546 +0x8d
github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache.(*processorListener).run-fm()
        /go/src/github.com/heptio/ark/vendor/k8s.io/client-go/tools/cache/shared_informer.go:390 +0x2a
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc0003a96b0, 0xc0003223c0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.(*Group).Start
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

Backup: cortex-ark-total-daily-20190222193143

Namespaces: Included: * Excluded:

Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com Cluster-scoped: auto

Namespace mappings:

Label selector:

Restore PVs: auto

Phase: InProgress

Validation errors:

Warnings: Errors:

Restic Restores (specify --details for more information): Completed: 1

* `ark restore logs <restorename>`

An error occurred: unable to retrieve logs because restore is not complete



**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]

**Environment:**

- Velero version (use `ark version`):  *v0.10.1*
- Kubernetes version (use `kubectl version`): *v1.12.5*
- Kubernetes installer & version: *Azure AKS*
- Cloud provider or hardware configuration: *Microsoft Azure*
- OS (e.g. from `/etc/os-release`): Unknown -- Azure AKS uses ubuntu images, I think?
PeterGrace commented 5 years ago

There may be a correlation with this crash and whether or not the restoreOnly site has ResticRepositories already defined. If there are no ResticRepositories defined, the crash occurs. When I checked ResticRepositories after the crash, there are two repositories defined:

bash-4.4# kubectl get resticrepositories -n heptio-ark
NAME                    AGE
default-default-gl95w   9m
default-default-hl89m   9m
bash-4.4#
skriss commented 5 years ago

Interesting. How many restic backups are in this backup?

PeterGrace commented 5 years ago

@skriss five total restic backups for 5 different pods. (1 restore per pod). After turning on debug, we see this:

time="2019-02-22T22:09:05Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222220043 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/letsrestore
time="2019-02-22T22:09:05Z" level=info msg="Restoring ClusterRoleBinding: system:controller:replicaset-controller" backup=cortex-ark-total-daily-20190222220043 logSource="p
kg/restore/restore.go:796" restore=heptio-ark/letsrestore
time="2019-02-22T22:09:05Z" level=debug msg="Ran restic command" command="restic check --repo=azure:ark:/restic/default --password-file=/tmp/ark-restic-credentials-default4
09359053" logSource="pkg/restic/repository_manager.go:238" repository=default stderr= stdout="using temporary cache in /tmp/restic-check-cache-236045888\ncreated new cache
in /tmp/restic-check-cache-236045888\ncreate exclusive lock for repository\nload indexes\ncheck all packs\ncheck snapshots, trees and blobs\nno errors were found\n"
E0222 22:09:05.826883       1 runtime.go:66] Observed a panic: "send on closed channel" (send on closed channel)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
...
skriss commented 5 years ago

When you get two repositories created, can you provide kubectl get resticrepositories -n heptio-ark -o yaml?

PeterGrace commented 5 years ago
apiVersion: v1
items:
- apiVersion: ark.heptio.com/v1
  kind: ResticRepository
  metadata:
    creationTimestamp: "2019-02-22T22:09:00Z"
    generateName: default-default-
    generation: 1
    labels:
      ark.heptio.com/storage-location: default
      ark.heptio.com/volume-namespace: default
    name: default-default-s979d
    namespace: heptio-ark
    resourceVersion: "4644"
    selfLink: /apis/ark.heptio.com/v1/namespaces/heptio-ark/resticrepositories/default-default-s979d
    uid: 751ab75f-36ee-11e9-8e6b-7ea983cb7289
  spec:
    backupStorageLocation: default
    maintenanceFrequency: 24h0m0s
    resticIdentifier: azure:ark:/restic/default
    volumeNamespace: default
  status:
    lastMaintenanceTime: "2019-02-22T22:09:03Z"
    message: ""
    phase: Ready
- apiVersion: ark.heptio.com/v1
  kind: ResticRepository
  metadata:
    creationTimestamp: "2019-02-22T22:09:00Z"
    generateName: default-default-
    generation: 1
    labels:
      ark.heptio.com/storage-location: default
      ark.heptio.com/volume-namespace: default
    name: default-default-xfkxn
    namespace: heptio-ark
    resourceVersion: "4740"
    selfLink: /apis/ark.heptio.com/v1/namespaces/heptio-ark/resticrepositories/default-default-xfkxn
    uid: 7525ffe8-36ee-11e9-8e6b-7ea983cb7289
  spec:
    backupStorageLocation: default
    maintenanceFrequency: 24h0m0s
    resticIdentifier: azure:ark:/restic/default
    volumeNamespace: default
  status:
    lastMaintenanceTime: "2019-02-22T22:09:05Z"
    message: ""
    phase: Ready
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
skriss commented 5 years ago

if you don't mind providing the full server log, I'd like to look at it (preferably in debug).

I think I've spotted half of the issue, but I'm not sure why you're getting two repos created, so trying to figure that out.

skriss commented 5 years ago

Ah. Is there a backup running at the same time as the restore by any chance?

PeterGrace commented 5 years ago

This latest message with the debug log was from a new test. At the time of that debug log shown in https://github.com/heptio/velero/issues/1233#issuecomment-466565601, there were other schedules but they were all Disabled. Only the total backup ran (and completed), prior to me doing the restore.