Closed PeterGrace closed 5 years ago
There may be a correlation with this crash and whether or not the restoreOnly site has ResticRepositories already defined. If there are no ResticRepositories defined, the crash occurs. When I checked ResticRepositories after the crash, there are two repositories defined:
bash-4.4# kubectl get resticrepositories -n heptio-ark
NAME AGE
default-default-gl95w 9m
default-default-hl89m 9m
bash-4.4#
Interesting. How many restic backups are in this backup?
@skriss five total restic backups for 5 different pods. (1 restore per pod). After turning on debug, we see this:
time="2019-02-22T22:09:05Z" level=info msg="Executing item action for clusterrolebindings.rbac.authorization.k8s.io" backup=cortex-ark-total-daily-20190222220043 logSource=
"pkg/restore/restore.go:764" restore=heptio-ark/letsrestore
time="2019-02-22T22:09:05Z" level=info msg="Restoring ClusterRoleBinding: system:controller:replicaset-controller" backup=cortex-ark-total-daily-20190222220043 logSource="p
kg/restore/restore.go:796" restore=heptio-ark/letsrestore
time="2019-02-22T22:09:05Z" level=debug msg="Ran restic command" command="restic check --repo=azure:ark:/restic/default --password-file=/tmp/ark-restic-credentials-default4
09359053" logSource="pkg/restic/repository_manager.go:238" repository=default stderr= stdout="using temporary cache in /tmp/restic-check-cache-236045888\ncreated new cache
in /tmp/restic-check-cache-236045888\ncreate exclusive lock for repository\nload indexes\ncheck all packs\ncheck snapshots, trees and blobs\nno errors were found\n"
E0222 22:09:05.826883 1 runtime.go:66] Observed a panic: "send on closed channel" (send on closed channel)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
...
When you get two repositories created, can you provide kubectl get resticrepositories -n heptio-ark -o yaml
?
apiVersion: v1
items:
- apiVersion: ark.heptio.com/v1
kind: ResticRepository
metadata:
creationTimestamp: "2019-02-22T22:09:00Z"
generateName: default-default-
generation: 1
labels:
ark.heptio.com/storage-location: default
ark.heptio.com/volume-namespace: default
name: default-default-s979d
namespace: heptio-ark
resourceVersion: "4644"
selfLink: /apis/ark.heptio.com/v1/namespaces/heptio-ark/resticrepositories/default-default-s979d
uid: 751ab75f-36ee-11e9-8e6b-7ea983cb7289
spec:
backupStorageLocation: default
maintenanceFrequency: 24h0m0s
resticIdentifier: azure:ark:/restic/default
volumeNamespace: default
status:
lastMaintenanceTime: "2019-02-22T22:09:03Z"
message: ""
phase: Ready
- apiVersion: ark.heptio.com/v1
kind: ResticRepository
metadata:
creationTimestamp: "2019-02-22T22:09:00Z"
generateName: default-default-
generation: 1
labels:
ark.heptio.com/storage-location: default
ark.heptio.com/volume-namespace: default
name: default-default-xfkxn
namespace: heptio-ark
resourceVersion: "4740"
selfLink: /apis/ark.heptio.com/v1/namespaces/heptio-ark/resticrepositories/default-default-xfkxn
uid: 7525ffe8-36ee-11e9-8e6b-7ea983cb7289
spec:
backupStorageLocation: default
maintenanceFrequency: 24h0m0s
resticIdentifier: azure:ark:/restic/default
volumeNamespace: default
status:
lastMaintenanceTime: "2019-02-22T22:09:05Z"
message: ""
phase: Ready
kind: List
metadata:
resourceVersion: ""
selfLink: ""
if you don't mind providing the full server log, I'd like to look at it (preferably in debug).
I think I've spotted half of the issue, but I'm not sure why you're getting two repos created, so trying to figure that out.
Ah. Is there a backup running at the same time as the restore by any chance?
This latest message with the debug log was from a new test. At the time of that debug log shown in https://github.com/heptio/velero/issues/1233#issuecomment-466565601, there were other schedules but they were all Disabled. Only the total backup ran (and completed), prior to me doing the restore.
What steps did you take and what happened: When attempting to restore a backup created in an Azure AKS kubernetes cluster, utilizing restic for disk backups, I received the below panic log, using v0.10.0. The restore then becomes stuck "InProgress" and will not resume nor clean itself up when the ark pod recreates.
What did you expect to happen:
I expected the data to be restored successfully.
The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/ark -n heptio-ark
ark backup describe <backupname>
orkubectl get backup/<backupname> -n heptio-ark -o yaml
ark restore describe <restorename>
orkubectl get restore/<restorename> -n heptio-ark -o yaml
Backup: cortex-ark-total-daily-20190222193143
Namespaces: Included: * Excluded:
Resources: Included: * Excluded: nodes, events, events.events.k8s.io, backups.ark.heptio.com, restores.ark.heptio.com Cluster-scoped: auto
Namespace mappings:
Label selector:
Restore PVs: auto
Phase: InProgress
Validation errors:
Warnings:
Errors:
Restic Restores (specify --details for more information): Completed: 1
An error occurred: unable to retrieve logs because restore is not complete