vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.61k stars 1.39k forks source link

Restore with CSI volume failed with crash #7874

Closed blackpiglet closed 3 months ago

blackpiglet commented 3 months ago

What steps did you take and what happened:

The restore with CSI volume failed with crash, and the trace is:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x73b0ae]

goroutine 415 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x281c120?, 0x4727010?})
        /usr/local/go/src/runtime/panic.go:770 +0x132
k8s.io/apimachinery/pkg/api/resource.(*Quantity).ScaledValue(0xc0013de900?, 0x864640?)
        /go/pkg/mod/k8s.io/apimachinery@v0.29.0/pkg/api/resource/quantity.go:768 +0xe
k8s.io/apimachinery/pkg/api/resource.(*Quantity).Value(0xc0013de900?)
        /go/pkg/mod/k8s.io/apimachinery@v0.29.0/pkg/api/resource/quantity.go:754 +0x15
github.com/vmware-tanzu/velero/internal/volume.(*RestoreVolumeInfoTracker).Result(0xc001678690)
        /go/src/github.com/vmware-tanzu/velero/internal/volume/volumes_information.go:779 +0xb05
github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).runValidatedRestore(0xc000756180, 0xc0010ac2c8, {0xc000862388?, 0xc000bc8a80?}, 0x0)
        /go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:655 +0x2225
github.com/vmware-tanzu/velero/pkg/controller.(*restoreReconciler).Reconcile(0xc000756180, {0x3145040, 0xc0017ceb40}, {{{0xc0014ce696?, 0x0?}, {0xc001800390?, 0xc0017ceb40?}}})
        /go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:264 +0xb45
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x314d1f8?, {0x3145040?, 0xc0017ceb40?}, {{{0xc0014ce696?, 0xb?}, {0xc001800390?, 0x0?}}})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005f0780, {0x3145078, 0xc00086e6e0}, {0x297d680, 0xc00032abc0})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316 +0x3bc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005f0780, {0x3145078, 0xc00086e6e0})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266 +0x1be
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 259
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:223 +0x50c

What did you expect to happen: The restore should complete without error.

The following information will help us better understand what's going on: The code here should check whether the size and status is nil before using. https://github.com/vmware-tanzu/velero/blob/21366795d147348b66a31a9e905de202222c9492/internal/volume/volumes_information.go#L779

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 3 months ago

The testbed was destroyed, so I didn't have a chance to check in what case the restoreSize may be nil but I'll add some checks when generating the volumeInfo is initialized.

msfrucht commented 3 months ago

@reasonerjt @yuanqijing The VolumeSnapshot.Status.RestoreSize can be 0 or nil comes from the CSI Spec.

// Information about a specific snapshot.
message Snapshot {
  // This is the complete size of the snapshot in bytes. The purpose of
  // this field is to give CO guidance on how much space is needed to
  // create a volume from this snapshot. The size of the volume MUST NOT
  // be less than the size of the source snapshot. This field is
  // OPTIONAL. If this field is not set, it indicates that this size is
  // unknown. The value of this field MUST NOT be negative and a size of
  // zero means it is unspecified.
  int64 size_bytes = 1;

Typically in these situations my organization has reused the PVC size if not available or 0 on restore. So far no issues with that.

It has been a long time since I've seen it set to 0 in VolumeSnapshot.Status.RestoreSize. And that was a bug in early versions of IBM Spectrum Scale CSI driver at the time.

The PVC size request turns into the CapacityRange.required_bytes field. Supposedly optional, but I'm not aware of how well most drivers deal with the size request of 0.