Closed pratclot closed 7 months ago
Hi, something's definitely wrong but it's not easy to tell at once what exactly :) at least I get ~600 MB/s write on a single consumer nvme on my laptop with 3 OSDs and 2 replicas in tests... First things to check that come to my mind are: 1) Check "avg latency" numbers in /var/log/vitastor/osd.log during benchmark - "subop write_stable" = disk latency+1/2network latency between primary and secondary (1/2 in your case because 1/2 of writes are handled locally on the primary OSD); "op write_stable" = just disk latency. "op sync" and "subop sync" are also similar. 2) Try to create a pool with only one OSD/disk and run benchmark on it alone
Your numbers are too low even on your hardware and I suspect that it's the network latency to blame, the checks above can confirm it. But I have no idea why Ceph handles it better, if it's the case ))
Hey Vitaliy, thank you so much for the response!
I did not mention before that I did not disable power saving on the CPU, not sure if this could be affect something (the goal is to keep the computers as calm as possible while outperforming Ceph).
The storage "network" is a Thunderbolt link, iperf reports bandwidth of ~20 Gbps, sockperf results depend on MTU: |
network | MTU | reported latency, usec | 90p | 99.999p |
---|---|---|---|---|---|
TB | 1500 | 120 | 160 | 230 | |
TB | 9000 | 70 | 80 | 180 | |
eth | 1500 | 380 | 420 | 450 |
I briefly tried to run etcd
over 2.5G eth
network, but it did not seem to improve anything.
Now the OSD logs, this is what it shows at rest (the ping value is anything up to 1ms):
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 13 (primary_sync): 7 us
[OSD 1] avg latency for subop 15 (ping): 928 us |
[OSD 1] avg latency for op 13 (primary_sync): 7 us
[OSD 1] avg latency for subop 15 (ping): 650 us |
This appears when I run 4M Q1 test (I tried to add scrollbars to table cells, but failed :P ):
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 588 us, B/W: 61.75 MB/s [OSD 1] avg latency for op 4 (sync): 420 us [OSD 1] avg latency for op 12 (primary_write): 5487 us, B/W: 77.62 MB/s [OSD 1] avg latency for op 13 (primary_sync): 157448 us [OSD 1] avg latency for subop 3 (write_stable): 5473 us [OSD 1] avg latency for subop 4 (sync): 88773 us [OSD 1] avg latency for subop 15 (ping): 855 us |
[OSD 1] avg latency for op 1 (read): 1 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 798 us, B/W: 66.50 MB/s [OSD 1] avg latency for op 4 (sync): 1022 us [OSD 1] avg latency for op 12 (primary_write): 5007 us, B/W: 82.83 MB/s [OSD 1] avg latency for op 13 (primary_sync): 156388 us [OSD 1] avg latency for subop 3 (write_stable): 4965 us [OSD 1] avg latency for subop 4 (sync): 90586 us [OSD 1] avg latency for subop 15 (ping): 632 us |
The same for 4M Q16:
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 910 us, B/W: 82.58 MB/s [OSD 1] avg latency for op 4 (sync): 4686 us [OSD 1] avg latency for op 12 (primary_write): 15097 us, B/W: 99.88 MB/s [OSD 1] avg latency for op 13 (primary_sync): 138312 us [OSD 1] avg latency for subop 3 (write_stable): 15086 us [OSD 1] avg latency for subop 4 (sync): 72925 us [OSD 1] avg latency for subop 15 (ping): 218 us |
[OSD 1] avg latency for op 1 (read): 2 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 1700 us, B/W: 80.75 MB/s [OSD 1] avg latency for op 4 (sync): 6642 us [OSD 1] avg latency for op 12 (primary_write): 26122 us, B/W: 100.58 MB/s [OSD 1] avg latency for op 13 (primary_sync): 121532 us [OSD 1] avg latency for subop 3 (write_stable): 26060 us [OSD 1] avg latency for subop 4 (sync): 67118 us [OSD 1] avg latency for subop 15 (ping): 355 us -------------- once in a while is starts to crawl -------------- [OSD 1] avg latency for op 1 (read): 3 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 3 (write_stable): 1991 us, B/W: 4.75 MB/s [OSD 1] avg latency for op 4 (sync): 2 us [OSD 1] avg latency for op 12 (primary_write): 283384 us, B/W: 5.92 MB/s [OSD 1] avg latency for op 13 (primary_sync): 3700118 us [OSD 1] avg latency for subop 3 (write_stable): 283375 us [OSD 1] avg latency for subop 4 (sync): 1916428 us [OSD 1] avg latency for subop 15 (ping): 762 us |
Now with one OSD, fio
runs on the node without it, 4M Q1:
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 1 (read): 0 us
[OSD 1] avg latency for op 3 (write_stable): 825 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 2497 us [OSD 1] avg latency for op 12 (primary_write): 827 us, B/W: 801.38 MB/s [OSD 1] avg latency for op 13 (primary_sync): 4862 us |
[OSD 1] avg latency for op 1 (read): 0 us
[OSD 1] avg latency for op 3 (write_stable): 775 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 2034 us [OSD 1] avg latency for op 12 (primary_write): 777 us, B/W: 783.88 MB/s [OSD 1] avg latency for op 13 (primary_sync): 3417 us |
4M Q16:
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 2954 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 8072 us [OSD 1] avg latency for op 12 (primary_write): 2956 us, B/W: 625.75 MB/s [OSD 1] avg latency for op 13 (primary_sync): 38833 us |
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 2252 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 6199 us [OSD 1] avg latency for op 12 (primary_write): 2255 us, B/W: 736.62 MB/s [OSD 1] avg latency for op 13 (primary_sync): 27965 us |
And finally fio
runs on the node with OSD, 4M Q1:
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 1 (read): 0 us
[OSD 1] avg latency for op 3 (write_stable): 952 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 2320 us [OSD 1] avg latency for op 12 (primary_write): 955 us, B/W: 952.42 MB/s [OSD 1] avg latency for op 13 (primary_sync): 4724 us |
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 979 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 2992 us [OSD 1] avg latency for op 12 (primary_write): 982 us, B/W: 951.33 MB/s [OSD 1] avg latency for op 13 (primary_sync): 6680 us |
4M Q16:
1500 | 9000 |
---|---|
[OSD 1] avg latency for op 1 (read): 2 us
[OSD 1] avg latency for op 3 (write_stable): 2690 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 6030 us [OSD 1] avg latency for op 12 (primary_write): 2695 us, B/W: 725.33 MB/s [OSD 1] avg latency for op 13 (primary_sync): 30820 us |
[OSD 1] avg latency for op 1 (read): 2 us
[OSD 1] avg latency for op 3 (write_stable): 3044 us, B/W: 0.00 KB/s [OSD 1] avg latency for op 4 (sync): 7995 us [OSD 1] avg latency for op 12 (primary_write): 3048 us, B/W: 588.38 MB/s [OSD 1] avg latency for op 13 (primary_sync): 42606 us |
dmesg
seems to be clean during testing, one of etcd
s spams journal with:
{"level":"warn","ts":"2024-02-02T14:02:35.966165+0100","caller":"embed/serve.go:331","msg":"error reading websocket message:websocket: close 1006 (abnormal closure): unexpected EOF"}
[OSD 1] avg latency for op 3 (write_stable): 588 us, B/W: 61.75 MB/s
absolutely OK for 4M
[OSD 1] avg latency for subop 3 (write_stable): 5473 us
not OK, network latency is like 10 ms (1/2 of write_stables is handled locally and also counted in this number)
[OSD 1] avg latency for op 4 (sync): 420 us
again fine
[OSD 1] avg latency for subop 4 (sync): 88773 us
real shit, network latency is 2*88 ms?!!
TLDR: better try to run OSDs through 2.5G.
I have no idea how Thunderbolt network is implemented but maybe it has some bugs in conjuction with io_uring which I use, for example
once in a while is starts to crawl
[OSD 1] avg latency for op 1 (read): 3 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 1991 us, B/W: 4.75 MB/s
[OSD 1] avg latency for op 4 (sync): 2 us
[OSD 1] avg latency for op 12 (primary_write): 283384 us, B/W: 5.92 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 3700118 us
[OSD 1] avg latency for subop 3 (write_stable): 283375 us
[OSD 1] avg latency for subop 4 (sync): 1916428 us
[OSD 1] avg latency for subop 15 (ping): 762 us
Also not sure what it means but again op write_stable
is 2 ms and subop write_stable
is 284 ms. This is the network again.
Vitaliy, thanks again for the analysis! I tried 2.5G network with both Ceph and Vitastor, the speeds were about 2 times lower than with Thunderbolt for both (for all test cases). I guess there is something that affects both network connections in a similar way (not sure how to find it though).
Hi in fact I accidentally found a funny bug which may negatively impact non-immediate_commit performance :D blockstore issues too many automatic syncs :-) I'll release 1.4.2 soon and ask you to retest :)
(I'm still not sure if it will help you because network latency numbers are really high in your case and it may still be a thunderbolt bug, but it may also happen to be my bug too :))
OK I released 1.4.2, please update and retest
Hey Vitaliy, thank you for your relentless improvements! I can see that 4M Q1 writes improved (ran with 1.4.1, then 1.4.2, then 1.4.1 again and then 1.4.2):
[OSD 2] avg latency for op 1 (read): 1 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 358 us, B/W: 140.54 MB/s
[OSD 2] avg latency for op 4 (sync): 1318 us
[OSD 2] avg latency for op 12 (primary_write): 6088 us, B/W: 142.38 MB/s
[OSD 2] avg latency for op 13 (primary_sync): 24208 us
[OSD 2] avg latency for subop 3 (write_stable): 6024 us
[OSD 2] avg latency for subop 4 (sync): 18950 us
[OSD 2] avg latency for subop 15 (ping): 426 us
For Q16:
[OSD 2] avg latency for op 1 (read): 2 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 617 us, B/W: 79.38 MB/s
[OSD 2] avg latency for op 4 (sync): 549 us
[OSD 2] avg latency for op 12 (primary_write): 37438 us, B/W: 84.12 MB/s
[OSD 2] avg latency for op 13 (primary_sync): 122871 us
[OSD 2] avg latency for subop 3 (write_stable): 37431 us
[OSD 2] avg latency for subop 4 (sync): 65245 us
[OSD 2] avg latency for subop 15 (ping): 874 us
4k Q128 seem to be 50% faster (originally were about 7 times slower than Ceph), 4k Q1 about 10% better (2.5 times better than Ceph).
The bottom line is the same of course.
Can you post the new results in a table for easier comparison?
What's the overall result? Did it become faster than Ceph or not? :-)
Hey Vitaliy, sorry for the late reply. Ceph is still faster for writes with iodepth > 1, but slower for everything else. I went with it this time, so to test something again I will need new disks I believe.
Okay... You could leave 50 gb partitions for testing on both nodes :) that's always a good idea with ceph&ssd too, by the way :)
Hi I finally did some testing with desktop nvmes, in fact I've never done it before :-)
The main result I found was that Vitastor requires special settings for desktop SSD/NVMes:
vitastor-disk prepare --min_flusher_count 32 --max_flusher_count 256 --disable_data_fsync 0 /dev/nvme0n1
It increases T1Q256 write performance from 6000 iops to 38000 iops per 1 disk. I tested it on just 2 hosts * 1 drive (Samsung 970 EVO) each. 1 OSD per disk is sufficient, OSD only eats ~80% of 1 CPU core.
Also the performance increases a bit more if you also set --autosync_writes 1024
but to do that you currently have to edit the superblock:
vitastor-disk update-sb --autosync_writes 1024 /dev/nvme0n1p1
After this change T1Q256 writes reach 44000 iops. :-)
Running 2 OSDs per disk almost doesn't improve anything. T1Q256 becomes 46000 iops, linear write also increases a bit, like 1.1 GB/s -> 1.2 GB/s, but that's all. For linear writes it's better to use larger --block_size (256k or 512k).
And given that WA=4 I think these ~45000 iops per disk are actually somewhere near the limit of the drive.
P.S I'll consider adding these options as default in vitastor-disk for desktop NVMe/SSDs :-)
Released in 1.4.5. I'm closing this issue by now.
Hello! I probably misconfigured something and this is the explanation for my issue, but I am not sure what exactly :)
I am trying to use Vitastor at home with 2 Intel NUCs, 2 consumer NVMe SSDs and Proxmox. I ran the benchmarks with Ceph (default Proxmox config) as suggested here, removed ceph, installed Vitastor (
--disable_data_fsync false
and--immediate_commit none
) and ran the same tests again (both times imported the same debian qcow2 to not test an empty image).Everything was faster, except for the writes with
iodepth
more than 1. I do not know a lot about storage benchmarks, but noticed that importing disks to Vitastor in Proxmox is significantly slower than with Ceph, and this seems to be well reflected by the tests too. Here are some details:Can you please help me understand what I am doing wrong here (except for using cheap hardware, hehe)?
The commands just in case I tested something irrelevant :)
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg
fio -ioengine=rbd -direct=1 -name=test -bs=4M -iodepth=16 -rw=write -pool=pool1 -runtime=60 -rbdname=testimg
fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4k -direct=1 -iodepth=128 -rw=randwrite -image=testimg
fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=pool1 -runtime=60 -rbdname=testimg