vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.7k stars 1.4k forks source link

Crash with SIGSEGV while finalizing backup of a PVC with CSI on AWS EKS #5207

Closed Va1 closed 2 years ago

Va1 commented 2 years ago

What steps did you take and what happened: Velero 1.9.0 is deployed on AWS EKS 1.22 via an official Helm chart v2.31.0. Plugins: AWS v1.5.0, CSI v0.3.0.

Upon backing up, right after CSI snapshots are created (both VolumeSnapshot, VolumeSnapshotContent in proper statuses and EBS snapshot desplays ready in AWS console) and backup is about to wrap up, Velero crashes with SIGSEGV. Backup stays in a Failed status.

Retried multiple times and it always ends this way.

What did you expect to happen: Backup succeeds and is restorable.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

Can not provide this at the moment.

But here are the logs printed prior to a crash:

2022/08/11 16:23:56  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:01  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:06  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:11  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
time="2022-08-11T16:24:12Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-08-11T16:24:12Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
time="2022-08-11T16:24:12Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:130"
time="2022-08-11T16:24:12Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:115"
2022/08/11 16:24:16  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
2022/08/11 16:24:21  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
I0811 16:24:23.683210       1 request.go:665] Waited for 1.046988495s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/apiextensions.k8s.io/v1?timeout=32s
2022/08/11 16:24:26  info Waiting for CSI driver to reconcile volumesnapshot ohlc/velero-questdb-questdb-0-kkltb. Retrying in 5s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19bfdcd]

goroutine 5971 [running]:
github.com/vmware-tanzu/velero/pkg/controller.(*backupController).deleteVolumeSnapshot.func1(0xc00045f040)
        /go/src/github.com/vmware-tanzu/velero/pkg/controller/backup_controller.go:931 +0xad
created by github.com/vmware-tanzu/velero/pkg/controller.(*backupController).deleteVolumeSnapshot
        /go/src/github.com/vmware-tanzu/velero/pkg/controller/backup_controller.go:927 +0xf7

A backup in question (one of) in yaml format:

apiVersion: velero.io/v1
kind: Backup
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade,post-rollback
    helm.sh/hook-delete-policy: before-hook-creation
    velero.io/source-cluster-k8s-gitversion: v1.22.10-eks-84b4fe6
    velero.io/source-cluster-k8s-major-version: "1"
    velero.io/source-cluster-k8s-minor-version: 22+
  creationTimestamp: "2022-08-11T23:00:39Z"
  generation: 5
  labels:
    app.kubernetes.io/instance: velero
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: velero
    helm.sh/chart: velero-2.31.0
    velero.io/schedule-name: velero-questdb-pvc
    velero.io/storage-location: default
  name: velero-questdb-pvc-20220811230039
  namespace: velero
  resourceVersion: "29774925"
  uid: 6358f885-1184-45a6-922b-9b87b33054c1
spec:
  defaultVolumesToRestic: false
  hooks: {}
  includeClusterResources: true
  includedNamespaces:
  - ohlc
  includedResources:
  - pvc
  - pv
  labelSelector:
    matchLabels:
      app.kubernetes.io/instance: questdb
      app.kubernetes.io/name: questdb
  metadata: {}
  snapshotVolumes: true
  storageLocation: default
  ttl: 168h0m0s
  volumeSnapshotLocations:
  - default
status:
  completionTimestamp: "2022-08-11T23:00:49Z"
  expiration: "2022-08-18T23:00:39Z"
  failureReason: get a backup with status "InProgress" during the server starting,
    mark it as "Failed"
  formatVersion: 1.1.0
  phase: Failed
  progress:
    itemsBackedUp: 2
    totalItems: 2
  startTimestamp: "2022-08-11T23:00:39Z"
  version: 1

A describe of a PersistentVolume created by a backup (one of):

Name:         velero-questdb-questdb-0-x84zb
Namespace:    ohlc
Labels:       velero.io/backup-name=velero-questdb-pvc-20220811230039
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshot
Metadata:
  Creation Timestamp:  2022-08-11T23:00:39Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
    snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  Generate Name:  velero-questdb-questdb-0-
  Generation:     1
  Managed Fields:
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection":
          v:"snapshot.storage.kubernetes.io/volumesnapshot-bound-protection":
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-08-11T23:00:39Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
        f:labels:
          .:
          f:velero.io/backup-name:
      f:spec:
        .:
        f:source:
          .:
          f:persistentVolumeClaimName:
        f:volumeSnapshotClassName:
    Manager:      velero-plugin-for-csi
    Operation:    Update
    Time:         2022-08-11T23:00:39Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:boundVolumeSnapshotContentName:
        f:creationTime:
        f:readyToUse:
        f:restoreSize:
    Manager:         Go-http-client
    Operation:       Update
    Subresource:     status
    Time:            2022-08-11T23:00:40Z
  Resource Version:  29774856
  UID:               56d87f8f-5a15-4c36-9930-35359c2c23c1
Spec:
  Source:
    Persistent Volume Claim Name:  questdb-questdb-0
  Volume Snapshot Class Name:      questdb-vsc
Status:
  Bound Volume Snapshot Content Name:  snapcontent-56d87f8f-5a15-4c36-9930-35359c2c23c1
  Creation Time:                       2022-08-11T23:00:40Z
  Ready To Use:                        true
  Restore Size:                        50Gi
Events:                                <none>

A describe of a PersistentVolumeContent created by a backup (one of):

Name:         snapcontent-56d87f8f-5a15-4c36-9930-35359c2c23c1
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  snapshot.storage.k8s.io/v1
Kind:         VolumeSnapshotContent
Metadata:
  Creation Timestamp:  2022-08-11T23:00:39Z
  Finalizers:
    snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection
  Generation:  1
  Managed Fields:
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"snapshot.storage.kubernetes.io/volumesnapshotcontent-bound-protection":
      f:spec:
        .:
        f:deletionPolicy:
        f:driver:
        f:source:
          .:
          f:volumeHandle:
        f:volumeSnapshotClassName:
        f:volumeSnapshotRef:
          .:
          f:apiVersion:
          f:kind:
          f:name:
          f:namespace:
          f:resourceVersion:
          f:uid:
    Manager:      Go-http-client
    Operation:    Update
    Time:         2022-08-11T23:00:40Z
    API Version:  snapshot.storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:creationTime:
        f:readyToUse:
        f:restoreSize:
        f:snapshotHandle:
    Manager:         Go-http-client
    Operation:       Update
    Subresource:     status
    Time:            2022-08-11T23:00:40Z
  Resource Version:  29774845
  UID:               dd15120a-fa73-4a9f-b3d7-28102e169489
Spec:
  Deletion Policy:  Delete
  Driver:           ebs.csi.aws.com
  Source:
    Volume Handle:             vol-069935c75bcc9a2db
  Volume Snapshot Class Name:  questdb-vsc
  Volume Snapshot Ref:
    API Version:       snapshot.storage.k8s.io/v1
    Kind:              VolumeSnapshot
    Name:              velero-questdb-questdb-0-x84zb
    Namespace:         ohlc
    Resource Version:  29774811
    UID:               56d87f8f-5a15-4c36-9930-35359c2c23c1
Status:
  Creation Time:    1660258840065000000
  Ready To Use:     true
  Restore Size:     53687091200
  Snapshot Handle:  snap-08a0e7632dac36f3f
Events:             <none>

Chart values overrides:

configuration:
  features: EnableCSI
  provider: aws
  backupStorageLocation:
    name: default
    provider: aws
    bucket: ***-velero-backup-storage
    config:
      region: eu-central-1
  volumeSnapshotLocation:
    name: default
    provider: aws
    config:
      region: eu-central-1

credentials:
  useSecret: false

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.5.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
  - name: velero-plugin-for-csi
    image: velero/velero-plugin-for-csi:v0.3.0
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

schedules:
  questdb-pvc:
    disabled: false
    schedule: "0 23 * * 1,2,3,4,5"
    csiSnapshotTimeout: 60m
    template:
      ttl: "168h"
      includedNamespaces:
        - ohlc
      includedResources:
        - pvc
        - pv
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: questdb
          app.kubernetes.io/instance: questdb
      includeClusterResources: true
      snapshotVolumes: true
      storageLocation: default
      volumeSnapshotLocations:
        - default

serviceAccount:
  server:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::***:role/***-velero

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 2 years ago

@Va1 I think this failure is caused by backup only includes resources PVC and PV. This is because Velero CSI plugin use BackupItemActions to implement some code logic for PVC, VolumeSnapshot, VolumeSnapshotContent and VolumeSnapshotClass, then the CSI plugin's code for VolumeSnapshot, VolumeSnapshotContent and VolumeSnapshotClass will not be run.

The SIGSEGV happened, because CSI plugin will wait for VolumeSnapshotContent at least is created and snapshot handle is created in VolumeSnapshot BackupItemAction. Since VolumeSnapshot is not included in backup, this code will not be run. Then right after VolumeSnapshot creation, and its Status is still none, even after checkVolumeSnapshotReadyToUse is run, the original array volumeSnapshots is not updated, so the Status section is still none.

I think I can make some change to make code more robust, but restore still needs VolumeSnapshot, VolumeSnapshotContent and VolumeSnapshotClass included in backup to work.

I suggest to create backup with this command: velero backup create csi-test --include-namespaces=ohlc

sseago commented 2 years ago

@blackpiglet That seems reasonable. You're right -- those other resources must also be included in the backup, but Velero should fail gracefully with a useful error message rather than crashing like that.

Va1 commented 2 years ago

@blackpiglet adding VolumeSnapshotClass, VolumeSnapshot and VolumeSnapshotContent to backup included resources resolved the issue, thank you.

as outlined by @sseago , i agree that it indeed should be a part of documentation and error message, ideally.

is this something i can help you with by submitting a pull request with a fix?

blackpiglet commented 2 years ago

@Va1 Sure. Welcome for contribution.