[BUG] Longhorn gives 500 error when trying to provision a volume created using a snapshot.

masteryyh commented 1 year ago

Describe the bug

In Harvester, use a snapshot to create a volume, and while the volume is in Pending state, attach the volume to a VM immediately. Start the VM and the VM would stuck in Starting state, SSH into one of the node and execute kubectl describe pvc <volume-name> can see some message given by Longhorn:

Events:
  Type     Reason                Age                From                                                                                      Message
  ----     ------                ----               ----                                                                                      -------
  Warning  ProvisioningFailed    23s (x4 over 37s)  driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e  failed to provision volume with StorageClass "longhorn-image-8rtv9": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running, code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
  Normal   ExternalProvisioning  8s (x5 over 37s)   persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator
  Normal   Provisioning          8s (x8 over 37s)   driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e  External provisioner is provisioning volume for claim "default/restored"
  Warning  ProvisioningFailed    8s (x4 over 37s)   driver.longhorn.io_csi-provisioner-77b757f445-6gvqc_f518389e-c9b8-4d09-abd4-8e143c33965e  failed to provision volume with StorageClass "longhorn-image-8rtv9": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running] from [http://longhorn-backend:9500/v1/volumes]

To Reproduce

Reproduce steps here

Expected behavior

Volume should be attached to VM after provisioned by longhorn and VM should boot up without problems.

Log or Support bundle

longhorn-support-bundle_902bc133-4666-44c4-8e51-093f4093bfdf_2022-10-27T01-38-29Z.zip

Environment

Longhorn version: 1.3.2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm in Harvester
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2 v1.24.7+rke2r1
- Number of management node in the cluster: 3
- Number of worker node in the cluster:
Node config
- OS type and version: Harvester v1.1.0
- CPU per node: 8
- Memory per node: 32Gi
- Disk type(e.g. SSD/NVMe): VirtIO HDD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Virtualized Harvester on Proxmox VE 7.2-3
Number of Longhorn volumes in the cluster: 4

guangbochen commented 1 year ago

@innobead we may need to backport this issue to LH v1.4.x milestone, for Harvester, the planning release date is April/04 with v1.2.0, can u please help to double-check if this is possible, thanks?

innobead commented 1 year ago

@guangbochen It's planned for 1.5.0, so it will be naturally backported to 1.4.x.

innobead commented 1 year ago

@PhanLe1010 Please help check this first to see the cause. Thanks.

PhanLe1010 commented 1 year ago

It looks like the source snapshot is on a detached volume: unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running. Could we verify if we don't have this problem when the source volume is in attached state @masteryyh @weizhe0422

Btw, provisioning a new volume from a snapshot of a detach volume, this will require the enhancement https://github.com/longhorn/longhorn-manager/pull/1541. This is a big feature so I think it is not possible to backport it to 1.4.x. cc @innobead

innobead commented 1 year ago

It looks like the source snapshot is on a detached volume: unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running. Could we verify if we don't have this problem when the source volume is in attached state @masteryyh @weizhe0422

@mantissahz Could you help check this part first? I assume there should be no issues with the attached volume. @masteryyh will also help clarify the reproduce and update here.

innobead commented 1 year ago

Btw, provisioning a new volume from a snapshot of a detach volume, this will require the enhancement longhorn/longhorn-manager#1541. This is a big feature so I think it is not possible to backport it to 1.4.x. cc @innobead

YES, for this new behavior, it's only available in 1.5.0. Currently, we only need to check if the existing behavior works as expected, which means creating a volume from a snapshot of a running volume should work.

mantissahz commented 1 year ago

Result:

There are no issues with the attached volume, PVC could be created from a VolumeSnapshot CR of the attached volume normally. Scale down the deployment to 0 and the volume will be detached, then it failed to create a PVC from a VolumeSnapshot CR of the detached volume.

Steps:

Enable CSI Snapshot support
Use Longhorn deployment example to create a PVC, a PV and a deployment.

Create VolumeSnapshotClass by the manifest

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
name: longhorn-snapshot-vsc
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
type: snap

Create the VolumeSnapshot by the manifest

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-csi-volume-snapshot-longhorn-snapshot
spec:
volumeSnapshotClassName: longhorn-snapshot-vsc
source:
persistentVolumeClaimName: mysql-pvc

Create the PVC by the manifest

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restore-from-csi-snapshot-pvc
spec:
storageClassName: longhorn
dataSource:
name: test-csi-volume-snapshot-longhorn-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
  storage: 2Gi

As mentioned in this section csi-volume-snapshot-associated-with-longhorn-snapshot/#current-limitation

innobead commented 1 year ago

Thanks @mantissahz .

@masteryyh If this is the same case with you only being able to create a volume from a running snapshot, then it's the current behavior.

About creating a volume from an inactive snapshot from a detached volume, it will be improved in 1.5.0.

cc @guangbochen

masteryyh commented 1 year ago

I'm trying to bump Longhorn version in Harvester to v1.4.1 and repeat the reproduce steps here, the VM will stuck in Starting phase and in status of VM this appears:

volumeSnapshotStatuses:
    - enabled: false
      name: disk-1
      reason: 2 matching VolumeSnapshotClasses for longhorn-image-4m88v
    - enabled: false
      name: cloudinitdisk
      reason: Snapshot is not supported for this volumeSource type [cloudinitdisk]

the volume can provision successfully though.

UPDATE: Updated longhorn to v1.4.1 and snapshot-controller to v6.2.1 and problem still exists :(

hunghvu commented 1 year ago

In regard to this bug, does this mean Harvester snapshot restore is unusable as of v1.1.2? In my case, snapshot restoration simply does not work.

innobead commented 1 year ago

@hunghvu your case looks different because the error is due to more than one engine exists, so that just means the source volume could be a migrating volume. Suggest you can create an issue at the harvester github repo instead to clarify the cause there.

innobead commented 1 year ago

It looks like the source snapshot is on a detached volume: unable to create volume: unable to create volume pvc-bf336afd-ad19-484d-bac8-60fbacedbfd6: failed to verify data source: cannot get client for volume pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a: engine is not running. Could we verify if we don't have this problem when the source volume is in attached state @masteryyh @weizhe0422

@masteryyh Can you help answer the questions from @PhanLe1010 ? thanks.

masteryyh commented 1 year ago

If I understand this correctly, the snapshot is on the attached volume 🤔 when I'm testing this before this volume is attached already

FrankYang0529 commented 10 months ago

I can't reproduce the issue in Harvester v1.2.1 with LH v1.4.3 with following steps:

Create a VM.
Create a snapshot for the volume in the VM.
After snapshot is finished. Stop the VM. The volume is detached automatically.
Create another PVC from the snapshot and update the VM to the new PVC. (We cannot use a pending PVC in GUI, so I update VM Yaml directly.)
Before the new PVC is bound, starting the VM.
VM can get up without error.

PhanLe1010 commented 10 months ago

@FrankYang0529

Create another PVC from the snapshot and update the VM to the new PVC. (We cannot use a pending PVC in GUI, so I update VM Yaml directly.)

How did you update the VM? Did you keep both old PVC and new PVC?

FrankYang0529 commented 10 months ago

Create another PVC from the snapshot and update the VM to the new PVC. (We cannot use a pending PVC in GUI, so I update VM Yaml directly.)

How did you update the VM? Did you keep both old PVC and new PVC?

Yes, I use kubectl edit to update VM and keep both PVC.

ChanYiLin commented 10 months ago

I have discussed with @FrankYang0529 and we both agree the original reproduction steps were a bit unusual here

Reproduce step:

Install a harvester 1.1.0-rc3 environment, with snapshot-controller image replaced with k8s.gcr.io/sig-storage/snapshot-controller:v5.0.1 and CRDs edited according to here;

Create a VM, with a 10Gi or other size of volume;

Create a snapshot of the volume;

Use the snapshot to create a volume;

While the volume not provisioned by longhorn yet, replace the volume use the restored volume in the VM;

Try to boot up the VM, the VM should stuck in Starting step;

SSH into one of the node, kubectl describe pvc should see the warning message given out by longhorn

In step 5.,

According to @FrankYang0529 , it is not possible to use the unbound PVC in GUI. One needs to wait until PVC is bound which means Volume is provisioned, then can use the PVC in Harvester.
When replacing the volume used in the VM, I think the old volume had been detached and that's why the new volume couldn't be provisioned. (This can be confirmed from the support bundle, pvc-6b2fd344-3843-48e8-9fff-bc4e5a090b3a was detached.)

ChanYiLin commented 10 months ago

cc @innobead I think we can close the issue for now since @FrankYang0529 has tested and it works as expected now.

innobead commented 10 months ago

@ChanYiLin Let's added wontfix label as well.

longhorn / longhorn