vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.57k stars 1.39k forks source link

Ark panics during restore using DO plugin #777

Closed rosskukulinski closed 6 years ago

rosskukulinski commented 6 years ago

What steps did you take and what happened:

Attempting to restore a backup of the example-nginx Pod & PV/PVC using the DigitalOcean plugin results in an Ark panic.

ark backup create namespace-only --include-namespaces=nginx-example
kubectl delete ns nginx-example --force --grace-period=0
ark restore create restore-ns --from-backup namespace-only

What did you expect to happen:

Should not panic - restore should complete.

The output of the following commands will help us better understand what's going on: Logs can be found here: https://gist.github.com/rosskukulinski/d3655715348abc3d2ea6dca1033db418

Environment:

StackPointCloud on DigitalOcean

Digital Ocean

skriss commented 6 years ago

@rosskukulinski looks like you probably have our sample plugin configured on there too? Can you dump the results of kubectl get deploy ark -n heptio-ark -o yaml? On first glance this looks unrelated to the DO plugin itself.

skriss commented 6 years ago

Also, if you do in fact have our sample plugin on there, can you run ark plugin remove <sample>, then try a backup & restore again?

rosskukulinski commented 6 years ago
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1beta1","kind":"Deployment","metadata":{"annotations":{},"name":"ark","namespace":"heptio-ark"},"spec":{"replicas":1,"template":{"metadata":{"annotations":{"prometheus.io/path":"/metrics","prometheus.io/port":"8085","prometheus.io/scrape":"true"},"labels":{"component":"ark"}},"spec":{"containers":[{"args":["server"],"command":["/ark"],"env":[{"name":"AWS_SHARED_CREDENTIALS_FILE","value":"/credentials/cloud"},{"name":"ARK_SCRATCH_DIR","value":"/scratch"},{"name":"DIGITALOCEAN_TOKEN","valueFrom":{"secretKeyRef":{"key":"digitalocean_token","name":"cloud-credentials"}}}],"image":"gcr.io/heptio-images/ark:latest","name":"ark","volumeMounts":[{"mountPath":"/credentials","name":"cloud-credentials"},{"mountPath":"/plugins","name":"plugins"},{"mountPath":"/scratch","name":"scratch"}]}],"restartPolicy":"Always","serviceAccountName":"ark","volumes":[{"name":"cloud-credentials","secret":{"secretName":"cloud-credentials"}},{"emptyDir":{},"name":"plugins"},{"emptyDir":{},"name":"scratch"}]}}}}
  creationTimestamp: 2018-08-20T17:04:50Z
  generation: 3
  labels:
    component: ark
  name: ark
  namespace: heptio-ark
  resourceVersion: "25471"
  selfLink: /apis/extensions/v1beta1/namespaces/heptio-ark/deployments/ark
  uid: 26373992-a49b-11e8-a9bb-d6573ae0af86
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      component: ark
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        component: ark
    spec:
      containers:
      - args:
        - server
        command:
        - /ark
        env:
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: ARK_SCRATCH_DIR
          value: /scratch
        - name: DIGITALOCEAN_TOKEN
          valueFrom:
            secretKeyRef:
              key: digitalocean_token
              name: cloud-credentials
        image: gcr.io/heptio-images/ark:0.9.3
        imagePullPolicy: Always
        name: ark
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /credentials
          name: cloud-credentials
        - mountPath: /plugins
          name: plugins
        - mountPath: /scratch
          name: scratch
      dnsPolicy: ClusterFirst
      initContainers:
      - image: quay.io/stackpoint/ark-blockstore-digitalocean:latest
        imagePullPolicy: IfNotPresent
        name: ark-blockstore-digitalocean
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /target
          name: plugins
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ark
      serviceAccountName: ark
      terminationGracePeriodSeconds: 30
      volumes:
      - name: cloud-credentials
        secret:
          defaultMode: 420
          secretName: cloud-credentials
      - emptyDir: {}
        name: plugins
      - emptyDir: {}
        name: scratch
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2018-08-20T18:36:28Z
    lastUpdateTime: 2018-08-20T18:36:28Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: 2018-08-20T21:38:40Z
    lastUpdateTime: 2018-08-20T21:38:40Z
    message: ReplicaSet "ark-5d48ddd4cf" has timed out progressing.
    reason: ProgressDeadlineExceeded
    status: "False"
    type: Progressing
  observedGeneration: 3
  readyReplicas: 1
  replicas: 2
  unavailableReplicas: 1
  updatedReplicas: 1
rosskukulinski commented 6 years ago

@skriss I can't seem to find where the sample plugin is loaded or is running.

I removed the DO plugin and am still getting the panic:

ark backup create default-no-do --include-namespaces=default
ark restore create testeroo --from-backup=default-no-do

logs:

time="2018-08-21T13:32:10Z" level=info msg="Starting backup" backup=heptio-ark/default-no-do logSource="pkg/controller/backup_controller.go:339"
time="2018-08-21T13:32:12Z" level=info msg="Backup completed" backup=heptio-ark/default-no-do logSource="pkg/controller/backup_controller.go:401"
time="2018-08-21T13:32:38Z" level=info msg="starting restore" backup=default-no-do logSource="pkg/controller/restore_controller.go:430" restore=heptio-ark/testeroo
E0821 13:32:40.663266       1 runtime.go:66] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/hashmap_fast.go:696
/go/src/github.com/heptio/ark/pkg/util/collections/map_utils.go:132
/go/src/github.com/heptio/ark/pkg/restore/merge_service_account.go:63
/go/src/github.com/heptio/ark/pkg/restore/restore.go:805
/go/src/github.com/heptio/ark/pkg/restore/restore.go:482
/go/src/github.com/heptio/ark/pkg/restore/restore.go:355
/go/src/github.com/heptio/ark/pkg/restore/restore.go:257
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:431
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:284
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:116
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:204
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:191
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: assignment to entry in nil map [recovered]
        panic: assignment to entry in nil map

goroutine 278 [running]:
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x17ff7a0, 0x1c3d310)
        /usr/local/go/src/runtime/panic.go:502 +0x229
github.com/heptio/ark/pkg/util/collections.MergeMaps(0x0, 0xc420a71f50)
        /go/src/github.com/heptio/ark/pkg/util/collections/map_utils.go:132 +0x11b
github.com/heptio/ark/pkg/restore.mergeServiceAccounts(0xc420a65480, 0xc420a65430, 0xf, 0x1a9d701, 0xc420a65430)
        /go/src/github.com/heptio/ark/pkg/restore/merge_service_account.go:63 +0x3df
github.com/heptio/ark/pkg/restore.(*context).restoreResource(0xc4206d3200, 0xc42039bb00, 0xf, 0xc42038fc34, 0x7, 0xc42038fc80, 0x3b, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/heptio/ark/pkg/restore/restore.go:805 +0x233b
github.com/heptio/ark/pkg/restore.(*context).restoreFromDir(0xc4206d3200, 0xc420a562c0, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/heptio/ark/pkg/restore/restore.go:482 +0xd3d
github.com/heptio/ark/pkg/restore.(*context).execute(0xc4206d3200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/heptio/ark/pkg/restore/restore.go:355 +0x451
github.com/heptio/ark/pkg/restore.(*kubernetesRestorer).Restore(0xc4200d24d0, 0xc420181680, 0xc4202cdd40, 0x1c41de0, 0xc42000c488, 0x1c41e00, 0xc42000c4c8, 0xc42020be00, 0x5, 0x8, ...)
        /go/src/github.com/heptio/ark/pkg/restore/restore.go:257 +0xbfc
github.com/heptio/ark/pkg/controller.(*restoreController).runRestore(0xc42031c410, 0xc420181680, 0xc4204ea3f0, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:431 +0xaa1
github.com/heptio/ark/pkg/controller.(*restoreController).processRestore(0xc42031c410, 0xc4203ef760, 0x13, 0x18, 0xc42067cdb8)
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:284 +0x705
github.com/heptio/ark/pkg/controller.(*restoreController).(github.com/heptio/ark/pkg/controller.processRestore)-fm(0xc4203ef760, 0x13, 0xc4204516c0, 0x16e1560)
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:116 +0x3e
github.com/heptio/ark/pkg/controller.(*restoreController).processNextWorkItem(0xc42031c410, 0xc42008a700)
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:204 +0xe0
github.com/heptio/ark/pkg/controller.(*restoreController).runWorker(0xc42031c410)
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:191 +0x2b
github.com/heptio/ark/pkg/controller.(*restoreController).(github.com/heptio/ark/pkg/controller.runWorker)-fm()
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178 +0x2a
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4202d67a0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4208a7fa0, 0x3b9aca00, 0x0, 0x34630b8a001, 0xc4200af0e0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4202d67a0, 0x3b9aca00, 0xc4200af0e0)
        /go/src/github.com/heptio/ark/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/heptio/ark/pkg/controller.(*restoreController).Run.func2(0xc42031c410, 0x1c621e0, 0xc42039e880, 0xc4204ea5f0)
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:178 +0x68
created by github.com/heptio/ark/pkg/controller.(*restoreController).Run
        /go/src/github.com/heptio/ark/pkg/controller/restore_controller.go:177 +0x31d
nrb commented 6 years ago

Can you download the backup and provide the service account objects (if any) in it? It looks like there's a service account without annotations that may be triggering a logic bug.

nrb commented 6 years ago

I'm suspecting https://github.com/heptio/ark/blob/v0.9.3/pkg/restore/merge_service_account.go#L37-L38 results in a nil Annotations map from the cluster's copy of the service account, then https://github.com/heptio/ark/blob/v0.9.3/pkg/restore/merge_service_account.go#L63 tries to merge it, and https://github.com/heptio/ark/blob/v0.9.3/pkg/util/collections/map_utils.go#L128 doesn't check for a nil map on the first argument.

rosskukulinski commented 6 years ago

@nrb here's a backup tgz:

nginx-example-no-do-data.tar.gz

My sonobuoy scan completed successfully:

Ran 143 of 996 Specs in 3884.585 seconds
SUCCESS! -- 143 Passed | 0 Failed | 0 Pending | 853 Skipped PASS

Ginkgo ran 1 suite in 1h4m45.465635726s
Test Suite Passed
rosskukulinski commented 6 years ago
{
  "apiVersion": "v1",
  "kind": "ServiceAccount",
  "metadata": {
    "annotations": {
      "ark.heptio.com/my-plugin": "1"
    },
    "creationTimestamp": "2018-08-20T18:19:05Z",
    "name": "default",
    "namespace": "nginx-example",
    "resourceVersion": "8562",
    "selfLink": "/api/v1/namespaces/nginx-example/serviceaccounts/default",
    "uid": "85845add-a4a5-11e8-a9bb-d6573ae0af86"
  },
  "secrets": [
    {
      "name": "default-token-gbnv8"
    }
  ]
}
skriss commented 6 years ago

yeah, @nrb that sounds right - and it's getting triggered bc at some point @rosskukulinski did have the example plugin running, which put an annotation on the svc account, so now that code branch is being exercised.

@rosskukulinski to proceed with DO testing, I would delete and re-create the nginx-example namespace, then use only that NS for testing, and don't use our example plugin anywhere (looks like you're not now, but were at some point in the past). Should get you around this issue and let you focus on DO.

nrb commented 6 years ago

If you can get the cluster's service account's YAML before deleting the namespace just to confirm that there are no annotations, that'd be great @rosskukulinski. We'll still need to fix the merge function, but that extra confirmation would be nice.

rosskukulinski commented 6 years ago
k -n nginx-example get serviceaccount default -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  creationTimestamp: 2018-08-21T16:33:42Z
  name: default
  namespace: nginx-example
  resourceVersion: "129563"
  selfLink: /api/v1/namespaces/nginx-example/serviceaccounts/default
  uid: f77b700a-a55f-11e8-a9bb-d6573ae0af86
secrets:
- name: default-token-jz6qk
nrb commented 6 years ago

Closing the loop here:

1) the MergeMaps function will need to be updated so that it checks for a nil map before writing to it. Annotations and other metadata written by plugins will cause crashes otherwise. 2) We found that the StackPointCloud image had the example plugins as well as their digital ocean volume plugin, which were writing the annotations.

$ docker run --entrypoint=/bin/sh --rm -ti  quay.io/stackpoint/ark-blockstore-digitalocean:latest
~ $ ls
bin      dev      etc      home     lib      media    mnt      plugins  proc     root     run      sbin     srv      sys      tmp      usr      var
~ $ cd plugins/
/plugins $ ls
ark-backupitemaction-log-and-annotate   ark-blockstore-digitalocean             ark-objectstore-file                    ark-restoreitemaction-log-and-annotate

I'll work on a test and fix for 1.

rosskukulinski commented 6 years ago

I opened an issue for (2) against the StackPointCloud digitialocean plugin: https://github.com/StackPointCloud/ark-plugin-digitalocean/issues/2

edevenport commented 6 years ago

Thanks @rosskukulinski - I saw the issue (2) opened and am working on that now.

edevenport commented 6 years ago

Issue 2 has been resolved. I pushed a new container with the examples removed.

nrb commented 6 years ago

Thanks for that @edevenport!