CSI: Node.ExpandVolume gets wrong staging path when volume is in use

dani commented 3 hours ago

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

AlmaLinux 9.4 Using Nomad from pre-built Linux AMD64 binaries Ceph CSI 3.12.2

Issue

Most operations with Ceph RBD volumes are working (so I guess my setup is correct), except for one thing : trying to resize a volume when it's in use (by altering min_capacity + max_capacity, then registering the volume again with nomad volume register volume.hcl). For example, if I try to resize the postgres-data[1] volume, in the "poc" namespace :

Error registering volume: Unexpected response code: 500 (rpc error: unable to update volume: 1 error occurred:
    * CSI.NodeExpandVolume error: node plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Failed as missing stash (internal open /local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json: no such file or directory))

Logs from the corresponding ceph-csi node shows the same error

2024-11-15 13:47:29.000 E1115 13:47:29.863468       1 utils.go:245] ID: 7874 Req-ID: 0001-0024-cbfda0a8-461a-4577-9be1-e229acb2bac5-0000000000000006-1db44304-55c6-4200-854f-d315a86375db GRPC error: rpc error: code = Internal desc = Failed as missing stash (internal open /local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json: no such file or directory)
2024-11-15 13:47:29.000 E1115 13:47:29.863457       1 nodeserver.go:1136] ID: 7874 Req-ID: 0001-0024-cbfda0a8-461a-4577-9be1-e229acb2bac5-0000000000000006-1db44304-55c6-4200-854f-d315a86375db failed to find image metadata: Failed as missing stash (internal open /local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json: no such file or directory)

The problem is that the CSI node gets the staging path as /local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/but the real staging path is /local/csi/staging/poc/postgres-data[1]/rw-file-system-single-node-writer/ (the name of the namespace the volume is registered in is missing)

Inside the Ceph RBD node

sh-5.1# ls -l /local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json
ls: cannot access '/local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json': No such file or directory
sh-5.1# ls -l /local/csi/staging/poc/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json
-rw-------. 1 root root 210 Nov 15 13:32 '/local/csi/staging/poc/postgres-data[1]/rw-file-system-single-node-writer/image-meta.json'

I can resize correctly when the volume is not in use. The issue might be related to the fix for this bug

Maybe other CSI plugins are also affected, but I can reproduce it only with Ceph (tried with democratic-csi iSCSI against a truenas server with no issue)

Reproduction steps

Create a ceph RBD volume in a specific namespace
Run a job using this volume
Try to resize the volume while it's in use

Expected Result

The volume should be resized

Actual Result

Ceph CSI node fails as it gets a incorrect staging path (from Nomad ? Not a CSI expert)

The only workarround is to stop the job, do the resize, start the job again

tgross commented 2 hours ago

Hi @dani! I was a little surprised to discover that online resize was supported at all! But it looks like Nomad's CSI library version is older than the addition of the Capabilities for VolumeExpansion where plugins can define their capability for online vs offline resize.

What's you're seeing is definitely weird, but I can't quite tell at a glance what the issue is. The code in the Nomad client that sends the RPC to the plugin is here in (volumeManager).ExpandVolume. The RPC call that's sent from the server to the client is created here in (CSIVolume).NodeExpand. Not much in the way of logic here.

"poc" is the namespace, right? The only two ways I could see that missing here are:

The volume object in the state store is missing its namespace somehow, so its empty when we assign it here. You could diagnose that via nomad volume status -namespace poc 'postgres-data[1]' to verify the namespace field is set.
The staging point isn't visible to Nomad here. But I don't see a way for that to happen without Nomad never having been able to mount the volume in the first place!

So this will definitely need more investigation. I'll mark it for a closer look.

dani commented 2 hours ago

Indeed, poc is the namespace where the volume (and the job using it) is created. The namespace is correctly populated when creating the volume


[dbd@laptop-103 ~]$ nomad volume status -namespace poc 'postgres-data[1]'
ID                   = postgres-data[1]
Name                 = postgres-data-1
Namespace            = poc
External ID          = 0001-0024-cbfda0a8-461a-4577-9be1-e229acb2bac5-0000000000000006-1db44304-55c6-4200-854f-d315a86375db
Plugin ID            = rbd.ceph-csi
Provider             = rbd.csi.ceph.com
Version              = v3.12.2
Capacity             = 47 GiB
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 6
Nodes Expected       = 6
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: xfs flags: [REDACTED]
Namespace            = poc

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
73ba6c96  4ae656ea  server      31       run      running  1h58m ago  1h57m ago

hashicorp / nomad