ceph-csi very slow on vm

plano-fwinkler commented 2 days ago

proxmox with ceph and talos as vm with ceph csi is much slower than openebs-hostpath, are there any modules missing for the kernel?

Environment

Talos version: 1.8.2
Kubernetes version: 1.31.2
Platform: proxmox with ceph storage

smira commented 2 days ago

The issue you posted doesn't have any relevant details, including the performance numbers, the way you set up things, etc.

Ceph is a complicated subject, and setting it up properly is not trivial.

plano-fwinkler commented 2 days ago

We have a Proxmox Cluster with 5 Nodes and a Ceph Cluster on the Proxmox. The Ceph Cluster has a 100GB nic.

if i testing with kubestr fio:

with a local path Storageclass ` ./kubestr fio -s openebs-hostpath PVC created kubestr-fio-pvc-qqb7w Pod created kubestr-fio-pod-4z7zc Running FIO test (default-fio) on StorageClass (openebs-hostpath) with a PVC of Size (100Gi) Elapsed time- 28.089900025s FIO test results:

FIO version - fio-3.36 Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops blocksize=4K filesize=2G iodepth=64 rw=randread read: IOPS=49767.750000 BW(KiB/s)=199087 iops: min=41961 max=61272 avg=49501.585938 bw(KiB/s): min=167847 max=245088 avg=198006.484375

JobName: write_iops blocksize=4K filesize=2G iodepth=64 rw=randwrite write: IOPS=21245.320312 BW(KiB/s)=84993 iops: min=9028 max=39728 avg=35385.707031 bw(KiB/s): min=36112 max=158912 avg=141543.125000

JobName: read_bw blocksize=128K filesize=2G iodepth=64 rw=randread read: IOPS=36891.605469 BW(KiB/s)=4722663 iops: min=31849 max=45298 avg=36709.964844 bw(KiB/s): min=4076761 max=5798144 avg=4698881.500000

JobName: write_bw blocksize=128k filesize=2G iodepth=64 rw=randwrite write: IOPS=33320.179688 BW(KiB/s)=4265520 iops: min=17652 max=40996 avg=33119.656250 bw(KiB/s): min=2259456 max=5247488 avg=4239321.500000

Disk stats (read/write): sda: ios=1454972/1046364 merge=0/22 ticks=1907168/1466570 in_queue=3393654, util=29.229431%

OK `

and with the ceph block Storageclass: rbd.csi.ceph.com

` ./kubestr fio -s ceph-block PVC created kubestr-fio-pvc-n7m9z Pod created kubestr-fio-pod-4jnqw Running FIO test (default-fio) on StorageClass (ceph-block) with a PVC of Size (100Gi) Elapsed time- 27.566283667s FIO test results:

FIO version - fio-3.36 Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops blocksize=4K filesize=2G iodepth=64 rw=randread read: IOPS=242.109741 BW(KiB/s)=983 iops: min=98 max=496 avg=257.322571 bw(KiB/s): min=392 max=1987 avg=1030.129028

JobName: write_iops blocksize=4K filesize=2G iodepth=64 rw=randwrite write: IOPS=224.676819 BW(KiB/s)=914 iops: min=2 max=768 avg=264.464294 bw(KiB/s): min=8 max=3072 avg=1058.357178

JobName: read_bw blocksize=128K filesize=2G iodepth=64 rw=randread read: IOPS=213.964386 BW(KiB/s)=27884 iops: min=90 max=462 avg=223.967743 bw(KiB/s): min=11520 max=59254 avg=28694.708984

JobName: write_bw blocksize=128k filesize=2G iodepth=64 rw=randwrite write: IOPS=219.214661 BW(KiB/s)=28548 iops: min=4 max=704 avg=258.035706 bw(KiB/s): min=512 max=90112 avg=33048.785156

Disk stats (read/write): rbd2: ios=8696/8655 merge=0/267 ticks=2245425/1975831 in_queue=4221257, util=99.504547%

OK

`

The talos machine has two nic's. One only to communicating with the ceph monitor's.

It's Working, but i think to slow.

smira commented 2 days ago

Then you need to dig further to understand why - what is the bottleneck, certainly Ceph block storage should be slower as it goes via the network, does replication, etc.

You can watch resource utilization to understand what is the bottleneck.

We are not aware of anything missing from the Talos side, and we do use Ceph a lot ourselves with Talos.

siderolabs / talos

ceph-csi very slow on vm #9754

Environment