Closed rosskukulinski closed 6 years ago
@rosskukulinski looks like you probably have our sample plugin configured on there too? Can you dump the results of kubectl get deploy ark -n heptio-ark -o yaml
? On first glance this looks unrelated to the DO plugin itself.
Also, if you do in fact have our sample plugin on there, can you run ark plugin remove <sample>
, then try a backup & restore again?
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"apps/v1beta1","kind":"Deployment","metadata":{"annotations":{},"name":"ark","namespace":"heptio-ark"},"spec":{"replicas":1,"template":{"metadata":{"annotations":{"prometheus.io/path":"/metrics","prometheus.io/port":"8085","prometheus.io/scrape":"true"},"labels":{"component":"ark"}},"spec":{"containers":[{"args":["server"],"command":["/ark"],"env":[{"name":"AWS_SHARED_CREDENTIALS_FILE","value":"/credentials/cloud"},{"name":"ARK_SCRATCH_DIR","value":"/scratch"},{"name":"DIGITALOCEAN_TOKEN","valueFrom":{"secretKeyRef":{"key":"digitalocean_token","name":"cloud-credentials"}}}],"image":"gcr.io/heptio-images/ark:latest","name":"ark","volumeMounts":[{"mountPath":"/credentials","name":"cloud-credentials"},{"mountPath":"/plugins","name":"plugins"},{"mountPath":"/scratch","name":"scratch"}]}],"restartPolicy":"Always","serviceAccountName":"ark","volumes":[{"name":"cloud-credentials","secret":{"secretName":"cloud-credentials"}},{"emptyDir":{},"name":"plugins"},{"emptyDir":{},"name":"scratch"}]}}}}
creationTimestamp: 2018-08-20T17:04:50Z
generation: 3
labels:
component: ark
name: ark
namespace: heptio-ark
resourceVersion: "25471"
selfLink: /apis/extensions/v1beta1/namespaces/heptio-ark/deployments/ark
uid: 26373992-a49b-11e8-a9bb-d6573ae0af86
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
component: ark
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8085"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
component: ark
spec:
containers:
- args:
- server
command:
- /ark
env:
- name: AWS_SHARED_CREDENTIALS_FILE
value: /credentials/cloud
- name: ARK_SCRATCH_DIR
value: /scratch
- name: DIGITALOCEAN_TOKEN
valueFrom:
secretKeyRef:
key: digitalocean_token
name: cloud-credentials
image: gcr.io/heptio-images/ark:0.9.3
imagePullPolicy: Always
name: ark
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /credentials
name: cloud-credentials
- mountPath: /plugins
name: plugins
- mountPath: /scratch
name: scratch
dnsPolicy: ClusterFirst
initContainers:
- image: quay.io/stackpoint/ark-blockstore-digitalocean:latest
imagePullPolicy: IfNotPresent
name: ark-blockstore-digitalocean
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /target
name: plugins
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: ark
serviceAccountName: ark
terminationGracePeriodSeconds: 30
volumes:
- name: cloud-credentials
secret:
defaultMode: 420
secretName: cloud-credentials
- emptyDir: {}
name: plugins
- emptyDir: {}
name: scratch
status:
availableReplicas: 1
conditions:
- lastTransitionTime: 2018-08-20T18:36:28Z
lastUpdateTime: 2018-08-20T18:36:28Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2018-08-20T21:38:40Z
lastUpdateTime: 2018-08-20T21:38:40Z
message: ReplicaSet "ark-5d48ddd4cf" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing
observedGeneration: 3
readyReplicas: 1
replicas: 2
unavailableReplicas: 1
updatedReplicas: 1
@skriss I can't seem to find where the sample plugin is loaded or is running.
I removed the DO plugin and am still getting the panic:
ark backup create default-no-do --include-namespaces=default
ark restore create testeroo --from-backup=default-no-do
logs:
time="2018-08-21T13:32:10Z" level=info msg="Starting backup" backup=heptio-ark/default-no-do logSource="pkg/controller/backup_controller.go:339"
time="2018-08-21T13:32:12Z" level=info msg="Backup completed" backup=heptio-ark/default-no-do logSource="pkg/controller/backup_controller.go:401"
time="2018-08-21T13:32:38Z" level=info msg="starting restore" backup=default-no-do logSource="pkg/controller/restore_controller.go:430" restore=heptio-ark/testeroo
E0821 13:32:40.663266 1 runtime.go:66] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/hashmap_fast.go:696
/go/src/github.com/heptio/ark/pkg/util/collections/map_utils.go:132
/go/src/github.com/heptio/ark/pkg/restore/merge_service_account.go:63
/go/src/github.com/heptio/ark/pkg/restore/restore.go:805
/go/src/github.com/heptio/ark/pkg/restore/restore.go:482
/go/src/github.com/heptio/ark/pkg/restore/restore.go:355
/go/src/github.com/heptio/ark/pkg/restore/restore.go:257
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:431
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:284
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:116
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:204
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:191
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: assignment to entry in nil map [recovered]
panic: assignment to entry in nil map
goroutine 278 [running]:
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x17ff7a0, 0x1c3d310)
/usr/local/go/src/runtime/panic.go:502 +0x229
github.com/heptio/ark/pkg/util/collections.MergeMaps(0x0, 0xc420a71f50)
/go/src/github.com/heptio/ark/pkg/util/collections/map_utils.go:132 +0x11b
github.com/heptio/ark/pkg/restore.mergeServiceAccounts(0xc420a65480, 0xc420a65430, 0xf, 0x1a9d701, 0xc420a65430)
/go/src/github.com/heptio/ark/pkg/restore/merge_service_account.go:63 +0x3df
github.com/heptio/ark/pkg/restore.(*context).restoreResource(0xc4206d3200, 0xc42039bb00, 0xf, 0xc42038fc34, 0x7, 0xc42038fc80, 0x3b, 0x0, 0x0, 0x0, ...)
/go/src/github.com/heptio/ark/pkg/restore/restore.go:805 +0x233b
github.com/heptio/ark/pkg/restore.(*context).restoreFromDir(0xc4206d3200, 0xc420a562c0, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/heptio/ark/pkg/restore/restore.go:482 +0xd3d
github.com/heptio/ark/pkg/restore.(*context).execute(0xc4206d3200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/heptio/ark/pkg/restore/restore.go:355 +0x451
github.com/heptio/ark/pkg/restore.(*kubernetesRestorer).Restore(0xc4200d24d0, 0xc420181680, 0xc4202cdd40, 0x1c41de0, 0xc42000c488, 0x1c41e00, 0xc42000c4c8, 0xc42020be00, 0x5, 0x8, ...)
/go/src/github.com/heptio/ark/pkg/restore/restore.go:257 +0xbfc
github.com/heptio/ark/pkg/controller.(*restoreController).runRestore(0xc42031c410, 0xc420181680, 0xc4204ea3f0, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:431 +0xaa1
github.com/heptio/ark/pkg/controller.(*restoreController).processRestore(0xc42031c410, 0xc4203ef760, 0x13, 0x18, 0xc42067cdb8)
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:284 +0x705
github.com/heptio/ark/pkg/controller.(*restoreController).(github.com/heptio/ark/pkg/controller.processRestore)-fm(0xc4203ef760, 0x13, 0xc4204516c0, 0x16e1560)
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:116 +0x3e
github.com/heptio/ark/pkg/controller.(*restoreController).processNextWorkItem(0xc42031c410, 0xc42008a700)
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:204 +0xe0
github.com/heptio/ark/pkg/controller.(*restoreController).runWorker(0xc42031c410)
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:191 +0x2b
github.com/heptio/ark/pkg/controller.(*restoreController).(github.com/heptio/ark/pkg/controller.runWorker)-fm()
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178 +0x2a
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4202d67a0)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4208a7fa0, 0x3b9aca00, 0x0, 0x34630b8a001, 0xc4200af0e0)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4202d67a0, 0x3b9aca00, 0xc4200af0e0)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/heptio/ark/pkg/controller.(*restoreController).Run.func2(0xc42031c410, 0x1c621e0, 0xc42039e880, 0xc4204ea5f0)
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178 +0x68
created by github.com/heptio/ark/pkg/controller.(*restoreController).Run
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:177 +0x31d
Can you download the backup and provide the service account objects (if any) in it? It looks like there's a service account without annotations that may be triggering a logic bug.
I'm suspecting https://github.com/heptio/ark/blob/v0.9.3/pkg/restore/merge_service_account.go#L37-L38 results in a nil Annotations map from the cluster's copy of the service account, then https://github.com/heptio/ark/blob/v0.9.3/pkg/restore/merge_service_account.go#L63 tries to merge it, and https://github.com/heptio/ark/blob/v0.9.3/pkg/util/collections/map_utils.go#L128 doesn't check for a nil map on the first argument.
@nrb here's a backup tgz:
nginx-example-no-do-data.tar.gz
My sonobuoy scan completed successfully:
Ran 143 of 996 Specs in 3884.585 seconds
SUCCESS! -- 143 Passed | 0 Failed | 0 Pending | 853 Skipped PASS
Ginkgo ran 1 suite in 1h4m45.465635726s
Test Suite Passed
{
"apiVersion": "v1",
"kind": "ServiceAccount",
"metadata": {
"annotations": {
"ark.heptio.com/my-plugin": "1"
},
"creationTimestamp": "2018-08-20T18:19:05Z",
"name": "default",
"namespace": "nginx-example",
"resourceVersion": "8562",
"selfLink": "/api/v1/namespaces/nginx-example/serviceaccounts/default",
"uid": "85845add-a4a5-11e8-a9bb-d6573ae0af86"
},
"secrets": [
{
"name": "default-token-gbnv8"
}
]
}
yeah, @nrb that sounds right - and it's getting triggered bc at some point @rosskukulinski did have the example plugin running, which put an annotation on the svc account, so now that code branch is being exercised.
@rosskukulinski to proceed with DO testing, I would delete and re-create the nginx-example namespace, then use only that NS for testing, and don't use our example plugin anywhere (looks like you're not now, but were at some point in the past). Should get you around this issue and let you focus on DO.
If you can get the cluster's service account's YAML before deleting the namespace just to confirm that there are no annotations, that'd be great @rosskukulinski. We'll still need to fix the merge function, but that extra confirmation would be nice.
k -n nginx-example get serviceaccount default -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: 2018-08-21T16:33:42Z
name: default
namespace: nginx-example
resourceVersion: "129563"
selfLink: /api/v1/namespaces/nginx-example/serviceaccounts/default
uid: f77b700a-a55f-11e8-a9bb-d6573ae0af86
secrets:
- name: default-token-jz6qk
Closing the loop here:
1) the MergeMaps
function will need to be updated so that it checks for a nil
map before writing to it. Annotations and other metadata written by plugins will cause crashes otherwise.
2) We found that the StackPointCloud image had the example plugins as well as their digital ocean volume plugin, which were writing the annotations.
$ docker run --entrypoint=/bin/sh --rm -ti quay.io/stackpoint/ark-blockstore-digitalocean:latest
~ $ ls
bin dev etc home lib media mnt plugins proc root run sbin srv sys tmp usr var
~ $ cd plugins/
/plugins $ ls
ark-backupitemaction-log-and-annotate ark-blockstore-digitalocean ark-objectstore-file ark-restoreitemaction-log-and-annotate
I'll work on a test and fix for 1.
I opened an issue for (2) against the StackPointCloud digitialocean plugin: https://github.com/StackPointCloud/ark-plugin-digitalocean/issues/2
Thanks @rosskukulinski - I saw the issue (2) opened and am working on that now.
Issue 2 has been resolved. I pushed a new container with the examples removed.
Thanks for that @edevenport!
What steps did you take and what happened:
Attempting to restore a backup of the example-nginx Pod & PV/PVC using the DigitalOcean plugin results in an Ark panic.
What did you expect to happen:
Should not panic - restore should complete.
The output of the following commands will help us better understand what's going on: Logs can be found here: https://gist.github.com/rosskukulinski/d3655715348abc3d2ea6dca1033db418
Environment:
ark version
):kubectl version
):StackPointCloud on DigitalOcean
Digital Ocean
/etc/os-release
): CoreOS Stable