[BUG] - unix /var/lib/kubelet/plugins/block.csi.vultr.com/csi.sock not accessible (for rook.io)

defaultbranch commented 2 years ago

Describe the bug Using rook.io, the rook pods rook-ceph-osd-prepare- fails to setup a PersistentVolumeClaim.

"describe pod" finally reports the event (warning) "MapVolume.SetUpDevice failed for volume "pvc-c869a0057b0c4904" : kubernetes.io/csi: blockMapper.stageVolumeForBlock failed to check STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/block.csi.vultr.com/csi.sock: connect: connection refused"" from the kubelet.

To Reproduce Steps to reproduce the behavior: (NOTE: This setup works fine on Azure AKS, with only the storageClassName adjusted.)

Create a fresh kubernetes cluster (probably 1 worker node is sufficient for reproduction)
For basic Rook setup: From the files in https://github.com/rook/rook/tree/master/deploy/examples, run kubectl apply -f for crds.yaml, common.yaml and operator.yaml, this creates CRDs, Roles and a rook-ceph-operator deployment/pod

Run kubectl apply -f for the following CephCluster yaml:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
# NOTE: see cluster.yaml in <https://github.com/rook/rook.git> for up-to-date image version 
image: quay.io/ceph/ceph:v17.2.1
allowUnsupported: false
dataDirHostPath: /var/lib/rook
skipUpgradeChecks: false
continueUpgradeAfterChecksEvenIfNotHealthy: false
waitTimeoutForHealthyOSDInMinutes: 10
mon:
count: 3
allowMultiplePerNode: false
mgr:
count: 2
allowMultiplePerNode: false
modules:
  - name: pg_autoscaler
    enabled: true
dashboard:
enabled: true
ssl: true
storage:
storageClassDeviceSets:
- name: set1
  # NOTE: change this to the number of nodes that should host an OSD
  count: 1
  portable: false
  tuneDeviceClass: false
  encrypted: false
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: vultr-block-storage-hdd
      accessModes:
        - ReadWriteOnce
      # NOTE: rook seems to expect a raw, unmounted device "volumeMode: Block"
      volumeMode: Block
      resources:
        requests:
          storage: 40Gi

See events in kubectl -n rook-ceph describe pod rook-ceph-osd-prepare-* after the pod(s) got stuck.

Expected behavior The pods rook-ceph-osd-prepare-* should disappear after a short time, and instead corresponding rook-ceph-osd-* pods (without -prepare-) should remain.

Additional context I was using VKE with Kubernetes 1.23.x.

ddymko commented 2 years ago

@defaultbranch

NOTE: rook seems to expect a raw, unmounted device "volumeMode: Block"

volumeMode: Block

IIRC the vultr-csi doesn't support a raw unmounted device

https://github.com/vultr/vultr-csi/blob/master/driver/mounter.go#L113

kaznak commented 8 months ago

Hi, I have encountered the same situation. I have a question about this matter.

do you plan to support volumeMode: Block in the future?
is it possible to support volumeMode: Block by changing this CSI implementation?

Thank you in advance.

cuppett commented 8 months ago

I have a question about this matter.

do you plan to support volumeMode: Block in the future?

is it possible to support volumeMode: Block by changing this CSI implementation?

Unofficially: I'd expect this should be possible (with changes). When attaching block devices to vultr instances normally, you get the raw device and can do stuff to them (create LVM volumes, basic filesystems, use LUKS, etc).

vultr / vultr-csi

[BUG] - unix /var/lib/kubelet/plugins/block.csi.vultr.com/csi.sock not accessible (for rook.io) #88

NOTE: rook seems to expect a raw, unmounted device "volumeMode: Block"