vitalif / vitastor

Simplified distributed block and file storage with strong consistency, like in Ceph (repository mirror)
https://vitastor.io
Other
128 stars 22 forks source link

Lower write speeds than Ceph #57

Closed pratclot closed 7 months ago

pratclot commented 7 months ago

Hello! I probably misconfigured something and this is the explanation for my issue, but I am not sure what exactly :)

I am trying to use Vitastor at home with 2 Intel NUCs, 2 consumer NVMe SSDs and Proxmox. I ran the benchmarks with Ceph (default Proxmox config) as suggested here, removed ceph, installed Vitastor (--disable_data_fsync false and --immediate_commit none) and ran the same tests again (both times imported the same debian qcow2 to not test an empty image).

Everything was faster, except for the writes with iodepth more than 1. I do not know a lot about storage benchmarks, but noticed that importing disks to Vitastor in Proxmox is significantly slower than with Ceph, and this seems to be well reflected by the tests too. Here are some details:

iodepth rw bs iops iops, ceph
16 write 4M 40.38 194.85
1 write 4M 34.38 forgot to record
128 randwrite 4k 1365.33 9126.15
1 randwrite 4k 282.48 122.82

Can you please help me understand what I am doing wrong here (except for using cheap hardware, hehe)?

The commands just in case I tested something irrelevant :)

fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -image=testimg fio -ioengine=rbd -direct=1 -name=test -bs=4M -iodepth=16 -rw=write -pool=pool1 -runtime=60 -rbdname=testimg fio -thread -ioengine=libfio_vitastor.so -name=test -bs=4k -direct=1 -iodepth=128 -rw=randwrite -image=testimg fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=pool1 -runtime=60 -rbdname=testimg

vitalif commented 7 months ago

Hi, something's definitely wrong but it's not easy to tell at once what exactly :) at least I get ~600 MB/s write on a single consumer nvme on my laptop with 3 OSDs and 2 replicas in tests... First things to check that come to my mind are: 1) Check "avg latency" numbers in /var/log/vitastor/osd.log during benchmark - "subop write_stable" = disk latency+1/2network latency between primary and secondary (1/2 in your case because 1/2 of writes are handled locally on the primary OSD); "op write_stable" = just disk latency. "op sync" and "subop sync" are also similar. 2) Try to create a pool with only one OSD/disk and run benchmark on it alone

Your numbers are too low even on your hardware and I suspect that it's the network latency to blame, the checks above can confirm it. But I have no idea why Ceph handles it better, if it's the case ))

pratclot commented 7 months ago

Hey Vitaliy, thank you so much for the response!

I did not mention before that I did not disable power saving on the CPU, not sure if this could be affect something (the goal is to keep the computers as calm as possible while outperforming Ceph).

The storage "network" is a Thunderbolt link, iperf reports bandwidth of ~20 Gbps, sockperf results depend on MTU: network MTU reported latency, usec 90p 99.999p
TB 1500 120 160 230
TB 9000 70 80 180
eth 1500 380 420 450

I briefly tried to run etcd over 2.5G eth network, but it did not seem to improve anything.

Now the OSD logs, this is what it shows at rest (the ping value is anything up to 1ms):

1500 9000
[OSD 1] avg latency for op 13 (primary_sync): 7 us
[OSD 1] avg latency for subop 15 (ping): 928 us
[OSD 1] avg latency for op 13 (primary_sync): 7 us
[OSD 1] avg latency for subop 15 (ping): 650 us

This appears when I run 4M Q1 test (I tried to add scrollbars to table cells, but failed :P ):

1500 9000
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 588 us, B/W: 61.75 MB/s
[OSD 1] avg latency for op 4 (sync): 420 us
[OSD 1] avg latency for op 12 (primary_write): 5487 us, B/W: 77.62 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 157448 us
[OSD 1] avg latency for subop 3 (write_stable): 5473 us
[OSD 1] avg latency for subop 4 (sync): 88773 us
[OSD 1] avg latency for subop 15 (ping): 855 us
[OSD 1] avg latency for op 1 (read): 1 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 798 us, B/W: 66.50 MB/s
[OSD 1] avg latency for op 4 (sync): 1022 us
[OSD 1] avg latency for op 12 (primary_write): 5007 us, B/W: 82.83 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 156388 us
[OSD 1] avg latency for subop 3 (write_stable): 4965 us
[OSD 1] avg latency for subop 4 (sync): 90586 us
[OSD 1] avg latency for subop 15 (ping): 632 us

The same for 4M Q16:

1500 9000
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 910 us, B/W: 82.58 MB/s
[OSD 1] avg latency for op 4 (sync): 4686 us
[OSD 1] avg latency for op 12 (primary_write): 15097 us, B/W: 99.88 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 138312 us
[OSD 1] avg latency for subop 3 (write_stable): 15086 us
[OSD 1] avg latency for subop 4 (sync): 72925 us
[OSD 1] avg latency for subop 15 (ping): 218 us
[OSD 1] avg latency for op 1 (read): 2 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 1700 us, B/W: 80.75 MB/s
[OSD 1] avg latency for op 4 (sync): 6642 us
[OSD 1] avg latency for op 12 (primary_write): 26122 us, B/W: 100.58 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 121532 us
[OSD 1] avg latency for subop 3 (write_stable): 26060 us
[OSD 1] avg latency for subop 4 (sync): 67118 us
[OSD 1] avg latency for subop 15 (ping): 355 us
--------------
once in a while is starts to crawl
--------------
[OSD 1] avg latency for op 1 (read): 3 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 1991 us, B/W: 4.75 MB/s
[OSD 1] avg latency for op 4 (sync): 2 us
[OSD 1] avg latency for op 12 (primary_write): 283384 us, B/W: 5.92 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 3700118 us
[OSD 1] avg latency for subop 3 (write_stable): 283375 us
[OSD 1] avg latency for subop 4 (sync): 1916428 us
[OSD 1] avg latency for subop 15 (ping): 762 us

Now with one OSD, fio runs on the node without it, 4M Q1:

1500 9000
[OSD 1] avg latency for op 1 (read): 0 us
[OSD 1] avg latency for op 3 (write_stable): 825 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 2497 us
[OSD 1] avg latency for op 12 (primary_write): 827 us, B/W: 801.38 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 4862 us
[OSD 1] avg latency for op 1 (read): 0 us
[OSD 1] avg latency for op 3 (write_stable): 775 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 2034 us
[OSD 1] avg latency for op 12 (primary_write): 777 us, B/W: 783.88 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 3417 us

4M Q16:

1500 9000
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 2954 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 8072 us
[OSD 1] avg latency for op 12 (primary_write): 2956 us, B/W: 625.75 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 38833 us
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 2252 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 6199 us
[OSD 1] avg latency for op 12 (primary_write): 2255 us, B/W: 736.62 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 27965 us

And finally fio runs on the node with OSD, 4M Q1:

1500 9000
[OSD 1] avg latency for op 1 (read): 0 us
[OSD 1] avg latency for op 3 (write_stable): 952 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 2320 us
[OSD 1] avg latency for op 12 (primary_write): 955 us, B/W: 952.42 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 4724 us
[OSD 1] avg latency for op 1 (read): 1 us
[OSD 1] avg latency for op 3 (write_stable): 979 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 2992 us
[OSD 1] avg latency for op 12 (primary_write): 982 us, B/W: 951.33 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 6680 us

4M Q16:

1500 9000
[OSD 1] avg latency for op 1 (read): 2 us
[OSD 1] avg latency for op 3 (write_stable): 2690 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 6030 us
[OSD 1] avg latency for op 12 (primary_write): 2695 us, B/W: 725.33 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 30820 us
[OSD 1] avg latency for op 1 (read): 2 us
[OSD 1] avg latency for op 3 (write_stable): 3044 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 4 (sync): 7995 us
[OSD 1] avg latency for op 12 (primary_write): 3048 us, B/W: 588.38 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 42606 us

dmesg seems to be clean during testing, one of etcds spams journal with:

{"level":"warn","ts":"2024-02-02T14:02:35.966165+0100","caller":"embed/serve.go:331","msg":"error reading websocket message:websocket: close 1006 (abnormal closure): unexpected EOF"}
vitalif commented 7 months ago

[OSD 1] avg latency for op 3 (write_stable): 588 us, B/W: 61.75 MB/s

absolutely OK for 4M

[OSD 1] avg latency for subop 3 (write_stable): 5473 us

not OK, network latency is like 10 ms (1/2 of write_stables is handled locally and also counted in this number)

[OSD 1] avg latency for op 4 (sync): 420 us

again fine

[OSD 1] avg latency for subop 4 (sync): 88773 us

real shit, network latency is 2*88 ms?!!

TLDR: better try to run OSDs through 2.5G.

vitalif commented 7 months ago

I have no idea how Thunderbolt network is implemented but maybe it has some bugs in conjuction with io_uring which I use, for example

vitalif commented 7 months ago

once in a while is starts to crawl

[OSD 1] avg latency for op 1 (read): 3 us, B/W: 0.00 KB/s
[OSD 1] avg latency for op 3 (write_stable): 1991 us, B/W: 4.75 MB/s
[OSD 1] avg latency for op 4 (sync): 2 us
[OSD 1] avg latency for op 12 (primary_write): 283384 us, B/W: 5.92 MB/s
[OSD 1] avg latency for op 13 (primary_sync): 3700118 us
[OSD 1] avg latency for subop 3 (write_stable): 283375 us
[OSD 1] avg latency for subop 4 (sync): 1916428 us
[OSD 1] avg latency for subop 15 (ping): 762 us 

Also not sure what it means but again op write_stable is 2 ms and subop write_stable is 284 ms. This is the network again.

pratclot commented 7 months ago

Vitaliy, thanks again for the analysis! I tried 2.5G network with both Ceph and Vitastor, the speeds were about 2 times lower than with Thunderbolt for both (for all test cases). I guess there is something that affects both network connections in a similar way (not sure how to find it though).

vitalif commented 7 months ago

Hi in fact I accidentally found a funny bug which may negatively impact non-immediate_commit performance :D blockstore issues too many automatic syncs :-) I'll release 1.4.2 soon and ask you to retest :)

vitalif commented 7 months ago

(I'm still not sure if it will help you because network latency numbers are really high in your case and it may still be a thunderbolt bug, but it may also happen to be my bug too :))

vitalif commented 7 months ago

OK I released 1.4.2, please update and retest

pratclot commented 7 months ago

Hey Vitaliy, thank you for your relentless improvements! I can see that 4M Q1 writes improved (ran with 1.4.1, then 1.4.2, then 1.4.1 again and then 1.4.2):

[OSD 2] avg latency for op 1 (read): 1 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 358 us, B/W: 140.54 MB/s
[OSD 2] avg latency for op 4 (sync): 1318 us
[OSD 2] avg latency for op 12 (primary_write): 6088 us, B/W: 142.38 MB/s
[OSD 2] avg latency for op 13 (primary_sync): 24208 us
[OSD 2] avg latency for subop 3 (write_stable): 6024 us
[OSD 2] avg latency for subop 4 (sync): 18950 us
[OSD 2] avg latency for subop 15 (ping): 426 us

For Q16:

[OSD 2] avg latency for op 1 (read): 2 us, B/W: 0.00 KB/s
[OSD 2] avg latency for op 3 (write_stable): 617 us, B/W: 79.38 MB/s
[OSD 2] avg latency for op 4 (sync): 549 us
[OSD 2] avg latency for op 12 (primary_write): 37438 us, B/W: 84.12 MB/s
[OSD 2] avg latency for op 13 (primary_sync): 122871 us
[OSD 2] avg latency for subop 3 (write_stable): 37431 us
[OSD 2] avg latency for subop 4 (sync): 65245 us
[OSD 2] avg latency for subop 15 (ping): 874 us

4k Q128 seem to be 50% faster (originally were about 7 times slower than Ceph), 4k Q1 about 10% better (2.5 times better than Ceph).

The bottom line is the same of course.

vitalif commented 7 months ago

Can you post the new results in a table for easier comparison?

What's the overall result? Did it become faster than Ceph or not? :-)

pratclot commented 7 months ago

Hey Vitaliy, sorry for the late reply. Ceph is still faster for writes with iodepth > 1, but slower for everything else. I went with it this time, so to test something again I will need new disks I believe.

vitalif commented 7 months ago

Okay... You could leave 50 gb partitions for testing on both nodes :) that's always a good idea with ceph&ssd too, by the way :)

vitalif commented 7 months ago

Hi I finally did some testing with desktop nvmes, in fact I've never done it before :-) The main result I found was that Vitastor requires special settings for desktop SSD/NVMes: vitastor-disk prepare --min_flusher_count 32 --max_flusher_count 256 --disable_data_fsync 0 /dev/nvme0n1 It increases T1Q256 write performance from 6000 iops to 38000 iops per 1 disk. I tested it on just 2 hosts * 1 drive (Samsung 970 EVO) each. 1 OSD per disk is sufficient, OSD only eats ~80% of 1 CPU core. Also the performance increases a bit more if you also set --autosync_writes 1024 but to do that you currently have to edit the superblock: vitastor-disk update-sb --autosync_writes 1024 /dev/nvme0n1p1 After this change T1Q256 writes reach 44000 iops. :-) Running 2 OSDs per disk almost doesn't improve anything. T1Q256 becomes 46000 iops, linear write also increases a bit, like 1.1 GB/s -> 1.2 GB/s, but that's all. For linear writes it's better to use larger --block_size (256k or 512k). And given that WA=4 I think these ~45000 iops per disk are actually somewhere near the limit of the drive.

vitalif commented 7 months ago

P.S I'll consider adding these options as default in vitastor-disk for desktop NVMe/SSDs :-)

vitalif commented 7 months ago

Released in 1.4.5. I'm closing this issue by now.