Open dani opened 3 hours ago
Hi @dani! I was a little surprised to discover that online resize was supported at all! But it looks like Nomad's CSI library version is older than the addition of the Capabilities for VolumeExpansion
where plugins can define their capability for online vs offline resize.
What's you're seeing is definitely weird, but I can't quite tell at a glance what the issue is. The code in the Nomad client that sends the RPC to the plugin is here in (volumeManager).ExpandVolume
. The RPC call that's sent from the server to the client is created here in (CSIVolume).NodeExpand
. Not much in the way of logic here.
"poc" is the namespace, right? The only two ways I could see that missing here are:
nomad volume status -namespace poc 'postgres-data[1]'
to verify the namespace field is set.So this will definitely need more investigation. I'll mark it for a closer look.
Indeed, poc is the namespace where the volume (and the job using it) is created. The namespace is correctly populated when creating the volume
[dbd@laptop-103 ~]$ nomad volume status -namespace poc 'postgres-data[1]'
ID = postgres-data[1]
Name = postgres-data-1
Namespace = poc
External ID = 0001-0024-cbfda0a8-461a-4577-9be1-e229acb2bac5-0000000000000006-1db44304-55c6-4200-854f-d315a86375db
Plugin ID = rbd.ceph-csi
Provider = rbd.csi.ceph.com
Version = v3.12.2
Capacity = 47 GiB
Schedulable = true
Controllers Healthy = 1
Controllers Expected = 1
Nodes Healthy = 6
Nodes Expected = 6
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = fs_type: xfs flags: [REDACTED]
Namespace = poc
Allocations
ID Node ID Task Group Version Desired Status Created Modified
73ba6c96 4ae656ea server 31 run running 1h58m ago 1h57m ago
Nomad version
Operating system and Environment details
AlmaLinux 9.4 Using Nomad from pre-built Linux AMD64 binaries Ceph CSI 3.12.2
Issue
Most operations with Ceph RBD volumes are working (so I guess my setup is correct), except for one thing : trying to resize a volume when it's in use (by altering min_capacity + max_capacity, then registering the volume again with nomad volume register volume.hcl). For example, if I try to resize the postgres-data[1] volume, in the "poc" namespace :
Logs from the corresponding ceph-csi node shows the same error
The problem is that the CSI node gets the staging path as
/local/csi/staging/postgres-data[1]/rw-file-system-single-node-writer/
but the real staging path is/local/csi/staging/poc/postgres-data[1]/rw-file-system-single-node-writer/
(the name of the namespace the volume is registered in is missing)Inside the Ceph RBD node
I can resize correctly when the volume is not in use. The issue might be related to the fix for this bug
Maybe other CSI plugins are also affected, but I can reproduce it only with Ceph (tried with democratic-csi iSCSI against a truenas server with no issue)
Reproduction steps
Expected Result
The volume should be resized
Actual Result
Ceph CSI node fails as it gets a incorrect staging path (from Nomad ? Not a CSI expert)
The only workarround is to stop the job, do the resize, start the job again