vitalif / vitastor

Simplified distributed block and file storage with strong consistency, like in Ceph (repository mirror)
https://vitastor.io
Other
128 stars 22 forks source link

[performance] Performance degradation over vitastor 32k #22

Open lnsyyj opened 2 years ago

lnsyyj commented 2 years ago

Hi @vitalif , I tested the performance of vitastor without RDMA.

Below 32k (4k, 8k, 16k performance is higher than ceph, latency is lower than ceph) Above 32k, including 32k (32k 64k 128k 256k 512k 1m performance is lower than ceph, latency is higher than ceph) The above rand and sequence have been tested, and the results are the same.

Does it have a special meaning for 32k? Can I adjust some parameters to improve the performance above 32k?

vitalif commented 2 years ago

I don't think so... There is a special meaning for 128k - it's the block (object) size. I.e. writes smaller than 128k are first written to the journal (so they result in additional WA) and writes starting with 128k are written using redirect-write. 32k-1M performance lower than ceph is strange, can you provide any details?

lnsyyj commented 2 years ago

I have 3 physical machines, each with 1 NVME, and 3 machines form an etcd cluster. The vitastor pool uses 3 copies (16 PGs). Use libvirt to create a ubuntu18.08 virtual machine through qemu, create two inodes in vitastor(each 100GB), one as the system disk and the other as the fio performance test disk. Use fio in the virtual machine to test the performance test disk.

OSD parameter

root@vita-3:~# systemctl cat vitastor-osd3.service 
# /etc/systemd/system/vitastor-osd3.service
[Unit]
Description=Vitastor object storage daemon osd.3
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/usr/bin/vitastor-osd \
    --etcd_address 192.168.1.217:2379/v3 \
    --bind_address 192.168.1.217 \
    --osd_num 3 \
    --disable_data_fsync 1 \
    --flusher_count 8 \
    --immediate_commit all \
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \
    --journal_no_same_sector_overwrites true \
    --journal_sector_buffer_count 1024 \
        --data_device /dev/disk/by-partuuid/2e2c938b-5f88-4354-8db6-7bac132b5b67     --journal_offset 0     --meta_offset 16777216     --data_offset 798183424 
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-partuuid/2e2c938b-5f88-4354-8db6-7bac132b5b67
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=vitastor.target

fio parameter

root@yujiang:~# cat run_fio_multhread_iscsi.sh 
#!/bin/bash
set -x

#SHELL_FOLDER=$(dirname "$0")
#source ${SHELL_FOLDER}/common_variable_iscsi

IOENGINE="libaio"

FILE_NAME="/dev/vdb"

RW=(randwrite randread)
BS=(4k 8k 16k 32k 64k 128k 256k 512k 1m)

NUMJOBS=16
DIRECT=1
SIZE="100g"
IODEPTH=4
RUNTIME=300

for rw_name in ${RW[@]}
do
  for bs_size in ${BS[@]}
  do
    fio -ioengine=${IOENGINE} -numjobs=${NUMJOBS} -direct=${DIRECT} -size=${SIZE} -iodepth=${IODEPTH} -runtime=${RUNTIME} -rw=${rw_name} -ba=${bs_size} -bs=${bs_size} -filename=${FILE_NAME} -name="iscsi_${IOENGINE}_${IODEPTH}_${rw_name}_${bs_size}" -group_reporting > iscsi_${IOENGINE}_${IODEPTH}_${rw_name}_${bs_size}.log
    sleep 60
  done
done
lnsyyj commented 2 years ago

ceph

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 19.7k 80.7MB/s 3243.62 (usec)
100GB 4K randread 40.7k 167MB/s 1566.41 (usec)
100GB 8K randwrite 17.6k 145MB/s 3623.02 (usec)
100GB 8K randread 39.8k 326MB/s 1602.89 (usec)
100GB 16K randwrite 16.2k 266MB/s 3940.96 (usec)
100GB 16K randread 33.4k 548MB/s 1910.04 (usec)
100GB 32K randwrite 14.4k 471MB/s 4451.7 (usec)
100GB 32K randread 28.1k 919MB/s 2276.41 (usec)
100GB 64K randwrite 12.6k 825MB/s 5076.83 (usec)
100GB 64K randread 16.0k 1049MB/s 3993.57 (usec)
100GB 128K randwrite 6919 907MB/s 9243.13 (usec)
100GB 128K randread 8002 1049MB/s 7991.51 (usec)
100GB 256K randwrite 3521 923MB/s 18.17 (msec)
100GB 256K randread 4001 1049MB/s 15987.12 (usec)
100GB 512K randwrite 1741 913MB/s 36.73 (msec)
100GB 512K randread 2000 1049MB/s 31979.61 (usec)
100GB 1M randwrite 871 914MB/s 73.43 (msec)
100GB 1M randread 1000 1049MB/s 63964.36 (usec)

vitastor

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 30.2k 124MB/s 2113.36 (usec)
100GB 4K randread 46.2k 189MB/s 1379.11 (usec)
100GB 8K randwrite 27.1k 222MB/s 2357.34 (usec)
100GB 8K randread 44.4k 364MB/s 1435.87 (usec)
100GB 16K randwrite 20.6k 337MB/s 3103.98 (usec)
100GB 16K randread 35.9k 589MB/s 1776.66 (usec)
100GB 32K randwrite 10.9k 358MB/s 5851.9 (usec)
100GB 32K randread 24.3k 797MB/s 2621.98 (usec)
100GB 64K randwrite 6020 395MB/s 10618.76 (usec)
100GB 64K randread 14.3k 937MB/s 4466.6 (usec)
100GB 128K randwrite 4971 652MB/s 12861.98 (usec)
100GB 128K randread 7984 1047MB/s 8005.8 (usec)
100GB 256K randwrite 2477 649MB/s 25819.94 (usec)
100GB 256K randread 3556 932MB/s 17988.72 (usec)
100GB 512K randwrite 1223 641MB/s 52.3 (msec)
100GB 512K randread 1733 909MB/s 36920.56 (usec)
100GB 1M randwrite 622 653MB/s 102.72 (msec)
100GB 1M randread 927 973MB/s 68.97 (msec)
vitalif commented 2 years ago

First of all I think you should remove --flusher_count option, in new versions it doesn't require manual tuning, because it's auto-tuned between 1 and max_flusher_count, and --flusher_count actually sets max_flusher_count So can you remove the option and retest?

lnsyyj commented 2 years ago

First of all I think you should remove --flusher_count option, in new versions it doesn't require manual tuning, because it's auto-tuned between 1 and max_flusher_count, and --flusher_count actually sets max_flusher_count So can you remove the option and retest?

Thank you, this result is actually the result of not adding the flusher_count parameter. It has not been tested after adding this parameter.

lnsyyj commented 2 years ago

The following are the test results of T1Q1

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 3012 12.3MB/s 327.57 (usec)
100GB 4K randread 2997 12.3MB/s 329.13 (usec)
100GB 8K randwrite 2189 17.9MB/s 451.76 (usec)
100GB 8K randread 2135 17.5MB/s 463.5 (usec)
100GB 16K randwrite 1859 30.5MB/s 532.83 (usec)
100GB 16K randread 1681 27.6MB/s 589.69 (usec)
100GB 32K randwrite 1499 49.1MB/s 661.65 (usec)
100GB 32K randread 1491 48.9MB/s 665.35 (usec)
100GB 64K randwrite 1195 78.3MB/s 831.32 (usec)
100GB 64K randread 1398 91.6MB/s 710.29 (usec)
100GB 128K randwrite 1204 158MB/s 824.82 (usec)
100GB 128K randread 1144 150MB/s 868.99 (usec)
100GB 256K randwrite 1017 267MB/s 977.98 (usec)
100GB 256K randread 1070 281MB/s 928.77 (usec)
100GB 512K randwrite 747 392MB/s 1333.52 (usec)
100GB 512K randread 973 510MB/s 1022.11 (usec)
100GB 1M randwrite 484 509MB/s 2056.96 (usec)
100GB 1M randread 774 812MB/s 1285.82 (usec)

In the README, your latency can reach 0.14ms, how can I achieve your effect?

T1Q1 write: 7087 iops (0.14ms latency)
T1Q1 read: 6838 iops (0.145ms latency)

Is 0.32ms-0.14ms related to RDMA?

vitalif commented 2 years ago

Best latency is achieved when you disable powersaving, cpupower idle-set -D 0 && cpupower frequency-set --governor performance. You can also pin each vitastor osd to 1 core and disable powersaving only for these cores

If it doesn't help then your network may be slow

Regarding writes, I still think your performance is too low. What if you run 1 fio job with rw=write bs=4M iodepth=4?

vitalif commented 2 years ago

0.14 was achieved without rdma

lnsyyj commented 2 years ago
After the CPU performance is turned on, the effect is not obvious. Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 3317 13.6MB/s 297.46 (usec)
100GB 4K randread 3143 12.9MB/s 313.77 (usec)
100GB 8K randwrite 2207 18.1MB/s 447.97 (usec)
100GB 8K randread 2268 18.6MB/s 435.96 (usec)
100GB 16K randwrite 1894 31.0MB/s 522.65 (usec)
100GB 16K randread 1755 28.8MB/s 564.65 (usec)
100GB 32K randwrite 1558 51.1MB/s 636.63 (usec)
100GB 32K randread 1548 50.7MB/s 640.94 (usec)
100GB 64K randwrite 1233 80.9MB/s 805.39 (usec)
100GB 64K randread 1422 93.3MB/s 697.73 (usec)
100GB 128K randwrite 1226 161MB/s 809.83 (usec)
100GB 128K randread 1170 153MB/s 848.97 (usec)
100GB 256K randwrite 1019 267MB/s 975.74 (usec)
100GB 256K randread 1158 304MB/s 858.06 (usec)
100GB 512K randwrite 768 403MB/s 1295.93 (usec)
100GB 512K randread 981 515MB/s 1013.64 (usec)
100GB 1M randwrite 507 532MB/s 1966.59 (usec)
100GB 1M randread 768 806MB/s 1296.3 (usec)
  1. Make our CPU model different, the performance is lower than your result? My cpu is
    2x Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (10 cores @ 2.80GHz)

    Your CPU is

    2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
  2. Should each NVME be made into 2 OSDs?
vitalif commented 2 years ago

Yes, 2 OSDs per NVMe should help bandwidth with large block sizes. Vitastor OSD is single-threaded, so you're probably just hitting the single thread b/w limit due to memory copying. RDMA would also help because it eliminates 1 extra copy of 2. DPDK could also help but I haven't found usable TCP (or something else...) implementations for it so it's not possible to implement it yet. Regarding latency, 3300 T1Q1 iops is still rather low, I'd expect at least 5000-5500 iops on average hardware. CPU model doesn't affect latency that much. Given that the effect of turning CPU powersaving off is not noticeable then I think it just doesn't work, something probably overrides that setting and it just doesn't get applied. For example RHEL's tuned may override it. You can also tune interrupts & NUMA, enable/disable irqbalance and so on. In one of my tests simply enabling irqbalance raised the bandwidth 3x times :-)

vitalif commented 2 years ago

You can also check your network with sockperf. sockperf sr --tcp on one node, sockperf pp -i SERVER_IP --tcp -m 4096 on the other one. Note that it reports half of the roundtrip (one-way latency).

lnsyyj commented 2 years ago

In fact, the IOPS increased by 3 times when the CPU performance mode is turned on compared to when the performance mode is not turned on. I have also turned it on before, and the above two results were performed with the CPU performance mode turned on.

What should I do about interrupts & NUMA & enable/disable irqbalance? I am not very familiar with this aspect.

sockperf result

sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 192.168.1.216   PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=9605; ReceivedMessages=9604
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=4265; ReceivedMessages=4265
sockperf: ====> avg-latency=64.476 (std-dev=1.374)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 64.476 usec
sockperf: Total 4265 observations; each percentile contains 42.65 observations
sockperf: ---> <MAX> observation =   76.196
sockperf: ---> percentile 99.999 =   76.196
sockperf: ---> percentile 99.990 =   76.196
sockperf: ---> percentile 99.900 =   74.508
sockperf: ---> percentile 99.000 =   68.758
sockperf: ---> percentile 90.000 =   65.041
sockperf: ---> percentile 75.000 =   64.607
sockperf: ---> percentile 50.000 =   64.467
sockperf: ---> percentile 25.000 =   64.329
sockperf: ---> <MIN> observation =   53.082
vitalif commented 2 years ago

You have 65 us one-way latency, so your roundtrip time is 130 us. Theoretical performance of vitastor for replicated writes is 2 RTT + 1 Disk write = 260 us+disk. Disk is I think around 10-40 us. So around 270-300 us in total = 3300-3700 iops T1Q1... That's similar to what you get. :-) What network do you use? Average 10G network should be around twice faster. :-)

Regarding interrupts etc, here is a good article https://blog.cloudflare.com/how-to-achieve-low-latency/ irqbalance service (installed via package manager) distributes network interrupts over all available cores which may be worse than manually pinning them to different cores of 1 CPU if the server has multiple CPUs. I.e. interrupts should be on the CPU to which the corresponding PCIe lanes are connected. Otherwise CPU will "proxy" memory requests through the other CPU and it will slow things down. On the other hand if you don't have irqbalance at all then interrupts may be not distributed over multiple CPU cores at all which will also lower the performance...

You can also try to disable offloads and interrupt coalescing (in case they're enabled), like ethtool -K enp3s0f0 gro off gso off tso off lro off sg off; ethtool -C enp3s0f0 rx-usecs 0.

lnsyyj commented 2 years ago

Great, I use tuned and installed irqbalance to reduce the network latency between vitastor hosts to 22 usec.

sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 192.168.1.217   PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=22594; ReceivedMessages=22593
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12442; ReceivedMessages=12442
sockperf: ====> avg-latency=22.069 (std-dev=0.871)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 22.069 usec
sockperf: Total 12442 observations; each percentile contains 124.42 observations
sockperf: ---> <MAX> observation =   74.262
sockperf: ---> percentile 99.999 =   74.262
sockperf: ---> percentile 99.990 =   63.500
sockperf: ---> percentile 99.900 =   27.764
sockperf: ---> percentile 99.000 =   24.001
sockperf: ---> percentile 90.000 =   22.471
sockperf: ---> percentile 75.000 =   22.075
sockperf: ---> percentile 50.000 =   21.939
sockperf: ---> percentile 25.000 =   21.828
sockperf: ---> <MIN> observation =   21.307

Adjust the network card interruption ethtool -K enp3s0f0 gro off gso off tso off lro off sg off; ethtool -C enp3s0f0 rx-usecs 0 4K randwrite IOPS increased to 3638, 270.60 (usec)

vitalif commented 2 years ago

Latency from inside VM doesn't matter because vitastor client connects to OSDs from the host

lnsyyj commented 2 years ago

Latency from inside VM doesn't matter because vitastor client connects to OSDs from the host

Thanks, I get it. Then I am going to split a SSD into 2 OSDs to see if the latency will be reduced.

vitalif commented 2 years ago

Q1 latency won't be reduced with split, but throughput with large iodepth should improve, yeah :-) By the way, what's the Q1T1 latency now after applying tuned fixes?

lnsyyj commented 2 years ago

When tuned-adm profile network-latency, the latency is reduced, but the throughput is also reduced.

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 3638 14.9MB/s 270.6 (usec)
100GB 4K randread 4095 16.8MB/s 240.27 (usec)
100GB 8K randwrite 3271 26.8MB/s 301.54 (usec)
100GB 8K randread 3414 27.0MB/s 288.84 (usec)
100GB 16K randwrite 2732 44.8MB/s 361.83 (usec)
100GB 16K randread 2596 42.5MB/s 380.87 (usec)
100GB 32K randwrite 2027 66.4MB/s 488.94 (usec)
100GB 32K randread 2243 73.5MB/s 441.61 (usec)
100GB 64K randwrite 1401 91.8MB/s 709.41 (usec)
100GB 64K randread 1812 119MB/s 547.71 (usec)
100GB 128K randwrite 1134 149MB/s 877.56 (usec)
100GB 128K randread 1384 181MB/s 718.4 (usec)
100GB 256K randwrite 864 227MB/s 1152.79 (usec)
100GB 256K randread 1143 300MB/s 870.33 (usec)
100GB 512K randwrite 581 305MB/s 1716.65 (usec)
100GB 512K randread 905 475MB/s 1100.02 (usec)
100GB 1M randwrite 355 373MB/s 2806.67 (usec)
100GB 1M randread 617 647MB/s 1615.37 (usec)
vitalif commented 2 years ago

3300 -> 3600 iops isn't a good reduce). Something in the network is probably still slow...

vitalif commented 2 years ago

you can check out what osds say in their logs about average latency 'avg latency for op xxx' is average execution latency for the operation, excluding roundtrip time from and to the client, but including latency for suboperations triggered by the primary osd on secondaries. for the "secondary" ops like "write_stable" it's just disk latency because they don't generate subops 'avg latency for subop xxx' is average execution latency for the sub-operation including the roundtrip to secondary OSD

vitalif commented 2 years ago

so in your case 'write_stable op latency' is around 60us (disk), and 'write_stable subop latency' is 160-200us, so rtt is still 100-140us. why is it so slow?) what network do you use?

lnsyyj commented 2 years ago

My configuration is

switch: S6720-SI Series Multi GE Switches
CPU: 2x Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (10 cores @ 2.80GHz)
Network card: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Very strange, why my network latency is so high. I don't have the latency data of the network card and switch, and I am not sure how much the system is wasting.

Does OSD ping log mean that there is no I/O?

Oct 19 06:38:00 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 115 us
Oct 19 06:38:06 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 117 us
Oct 19 06:38:09 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 110 us
Oct 19 06:38:15 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 124 us
Oct 19 06:38:21 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 113 us
vitalif commented 2 years ago

Idle pings may be slower than 'real' even when CPU powersave is turned off. I get 60 us idle ping time between 2 osds on localhost even with powersave disabled, but it drops to ~35 us when testing T1Q1 randread. But in your case 120 us seems similar to the "normal" ping, i.e. similar to just the network RTT in your case... how high is your ping latency when testing T1Q1 randread?

lnsyyj commented 2 years ago

When T1Q1 randread 4K, fio runs for 5 minutes, starting at 21:45:04. The ping delay did not increase, but when fio ended, the OSDping delay on vita-2 increased.

I have uploaded log to: https://github.com/lnsyyj/vitastor-log/tree/main/vitastor-osd-log/20211020

lnsyyj commented 2 years ago

When randwrite 64k, it appeared in the osd log. Is this a journal collection?

Oct 20 03:12:55 vita-3 vitastor-osd[1679]: Ran out of journal space (used_start=0086f000, next_free=0084d000, dirty_start=0083c000)
vitalif commented 2 years ago

When randwrite 64k, it appeared in the osd log. Is this a journal collection? As long as it doesn't block or noticeably slow down writes it's OK. It means that OSD writes journal entries faster than it flushes old entries. You can try to raise journal size, but it may not remove these messages because it's the speed that matters, not just the free space. Anyway some amount of these messages is OK :-)

vitalif commented 2 years ago

I have uploaded log to: https://github.com/lnsyyj/vitastor-log/tree/main/vitastor-osd-log/20211020

Yes it seems ping time is still about 90-100us even when OSDs are active... so it's still the same, network problem should be found and fixed to improve latency :-)

I also tested vitastor with a single NVMe and yes, it seems it requires 2-3 OSDs per NVMe to fully utilize linear write speed, just because of memory copying.

lnsyyj commented 2 years ago

I think about how to reduce network latency, but maybe my hardware has reached its limit.

vitalif commented 2 years ago

I think about how to reduce network latency, but maybe my hardware has reached its limit.

But at the same time you say sockperf shows ~22us? Can you recheck sockperf between OSD nodes and between OSD and client?

lnsyyj commented 2 years ago

I have 3 servers in a vitastor cluster, and a virtual machine runs on one of them. vita-1 10.10.0.215 vita-2 10.10.0.216 virtual machine running node vita-3 10.10.0.217 ubuntu18.04 virtual machine 192.168.122.213

The following are the results of sockperf running between physical nodes:

10.10.0.215 -> 10.10.0.215
root@vita-1:~# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=66350; ReceivedMessages=66349
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.546 sec; SentMessages=36325; ReceivedMessages=36325
sockperf: ====> avg-latency=7.484 (std-dev=0.157)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 7.484 usec
sockperf: Total 36325 observations; each percentile contains 363.25 observations
sockperf: ---> <MAX> observation =   15.054
sockperf: ---> percentile 99.999 =   15.054
sockperf: ---> percentile 99.990 =   11.008
sockperf: ---> percentile 99.900 =    9.398
sockperf: ---> percentile 99.000 =    7.791
sockperf: ---> percentile 90.000 =    7.603
sockperf: ---> percentile 75.000 =    7.538
sockperf: ---> percentile 50.000 =    7.466
sockperf: ---> percentile 25.000 =    7.407
sockperf: ---> <MIN> observation =    6.336

10.10.0.215 -> 10.10.0.216
root@vita-1:~# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23199; ReceivedMessages=23198
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12788; ReceivedMessages=12788
sockperf: ====> avg-latency=21.477 (std-dev=0.384)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.477 usec
sockperf: Total 12788 observations; each percentile contains 127.88 observations
sockperf: ---> <MAX> observation =   29.251
sockperf: ---> percentile 99.999 =   29.251
sockperf: ---> percentile 99.990 =   28.595
sockperf: ---> percentile 99.900 =   27.234
sockperf: ---> percentile 99.000 =   22.760
sockperf: ---> percentile 90.000 =   21.633
sockperf: ---> percentile 75.000 =   21.530
sockperf: ---> percentile 50.000 =   21.434
sockperf: ---> percentile 25.000 =   21.349
sockperf: ---> <MIN> observation =   20.915

10.10.0.215 -> 10.10.0.217
root@vita-1:~# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23040; ReceivedMessages=23039
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12695; ReceivedMessages=12695
sockperf: ====> avg-latency=21.640 (std-dev=1.140)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.640 usec
sockperf: Total 12695 observations; each percentile contains 126.95 observations
sockperf: ---> <MAX> observation =   84.146
sockperf: ---> percentile 99.999 =   84.146
sockperf: ---> percentile 99.990 =   77.598
sockperf: ---> percentile 99.900 =   28.659
sockperf: ---> percentile 99.000 =   24.601
sockperf: ---> percentile 90.000 =   21.860
sockperf: ---> percentile 75.000 =   21.626
sockperf: ---> percentile 50.000 =   21.495
sockperf: ---> percentile 25.000 =   21.386
sockperf: ---> <MIN> observation =   20.845

10.10.0.216 -> 10.10.0.215
root@vita-2:~# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23230; ReceivedMessages=23229
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12783; ReceivedMessages=12783
sockperf: ====> avg-latency=21.489 (std-dev=0.607)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.489 usec
sockperf: Total 12783 observations; each percentile contains 127.83 observations
sockperf: ---> <MAX> observation =   30.769
sockperf: ---> percentile 99.999 =   30.769
sockperf: ---> percentile 99.990 =   29.548
sockperf: ---> percentile 99.900 =   27.635
sockperf: ---> percentile 99.000 =   24.835
sockperf: ---> percentile 90.000 =   21.720
sockperf: ---> percentile 75.000 =   21.528
sockperf: ---> percentile 50.000 =   21.382
sockperf: ---> percentile 25.000 =   21.271
sockperf: ---> <MIN> observation =   20.840

10.10.0.216 -> 10.10.0.216
root@vita-2:~# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=48581; ReceivedMessages=48580
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=26678; ReceivedMessages=26678
sockperf: ====> avg-latency=10.271 (std-dev=0.200)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 10.271 usec
sockperf: Total 26678 observations; each percentile contains 266.78 observations
sockperf: ---> <MAX> observation =   17.884
sockperf: ---> percentile 99.999 =   17.884
sockperf: ---> percentile 99.990 =   14.199
sockperf: ---> percentile 99.900 =   12.691
sockperf: ---> percentile 99.000 =   10.743
sockperf: ---> percentile 90.000 =   10.389
sockperf: ---> percentile 75.000 =   10.320
sockperf: ---> percentile 50.000 =   10.251
sockperf: ---> percentile 25.000 =   10.188
sockperf: ---> <MIN> observation =    9.932

10.10.0.216 -> 10.10.0.217
root@vita-2:~# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=21431; ReceivedMessages=21430
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=11795; ReceivedMessages=11795
sockperf: ====> avg-latency=23.300 (std-dev=1.512)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 23.300 usec
sockperf: Total 11795 observations; each percentile contains 117.95 observations
sockperf: ---> <MAX> observation =  126.662
sockperf: ---> percentile 99.999 =  126.662
sockperf: ---> percentile 99.990 =   97.480
sockperf: ---> percentile 99.900 =   29.805
sockperf: ---> percentile 99.000 =   26.100
sockperf: ---> percentile 90.000 =   23.848
sockperf: ---> percentile 75.000 =   23.254
sockperf: ---> percentile 50.000 =   23.089
sockperf: ---> percentile 25.000 =   22.974
sockperf: ---> <MIN> observation =   22.592

10.10.0.217 -> 10.10.0.215
root@vita-3:~# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23042; ReceivedMessages=23041
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12689; ReceivedMessages=12689
sockperf: ====> avg-latency=21.618 (std-dev=0.493)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.618 usec
sockperf: Total 12689 observations; each percentile contains 126.89 observations
sockperf: ---> <MAX> observation =   29.375
sockperf: ---> percentile 99.999 =   29.375
sockperf: ---> percentile 99.990 =   29.023
sockperf: ---> percentile 99.900 =   27.652
sockperf: ---> percentile 99.000 =   23.687
sockperf: ---> percentile 90.000 =   21.809
sockperf: ---> percentile 75.000 =   21.645
sockperf: ---> percentile 50.000 =   21.529
sockperf: ---> percentile 25.000 =   21.430
sockperf: ---> <MIN> observation =   20.988

10.10.0.217 -> 10.10.0.216
root@vita-3:~# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=21532; ReceivedMessages=21531
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=11849; ReceivedMessages=11849
sockperf: ====> avg-latency=23.176 (std-dev=0.623)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 23.176 usec
sockperf: Total 11849 observations; each percentile contains 118.49 observations
sockperf: ---> <MAX> observation =   34.893
sockperf: ---> percentile 99.999 =   34.893
sockperf: ---> percentile 99.990 =   31.490
sockperf: ---> percentile 99.900 =   29.786
sockperf: ---> percentile 99.000 =   25.420
sockperf: ---> percentile 90.000 =   23.761
sockperf: ---> percentile 75.000 =   23.166
sockperf: ---> percentile 50.000 =   23.002
sockperf: ---> percentile 25.000 =   22.890
sockperf: ---> <MIN> observation =   22.465

10.10.0.217 -> 10.10.0.217
root@vita-3:~# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=48284; ReceivedMessages=48283
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=26512; ReceivedMessages=26512
sockperf: ====> avg-latency=10.335 (std-dev=2.412)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 10.335 usec
sockperf: Total 26512 observations; each percentile contains 265.12 observations
sockperf: ---> <MAX> observation =  361.655
sockperf: ---> percentile 99.999 =  361.655
sockperf: ---> percentile 99.990 =   25.666
sockperf: ---> percentile 99.900 =   17.767
sockperf: ---> percentile 99.000 =   13.521
sockperf: ---> percentile 90.000 =   10.381
sockperf: ---> percentile 75.000 =   10.298
sockperf: ---> percentile 50.000 =   10.225
sockperf: ---> percentile 25.000 =   10.162
sockperf: ---> <MIN> observation =    9.910

The following is the result of sockperf to the physical node inside the virtual machine node:

192.168.122.213 -> 10.10.0.215
root@ubuntu:~/sockperf# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-0.git737d1e3e5576 == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=8929; ReceivedMessages=8928
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=4865; ReceivedMessages=4865
sockperf: ====> avg-latency=56.519 (std-dev=11.084)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 56.519 usec
sockperf: Total 4865 observations; each percentile contains 48.65 observations
sockperf: ---> <MAX> observation =  209.568
sockperf: ---> percentile 99.999 =  209.568
sockperf: ---> percentile 99.990 =  209.568
sockperf: ---> percentile 99.900 =  128.706
sockperf: ---> percentile 99.000 =   80.863
sockperf: ---> percentile 90.000 =   71.991
sockperf: ---> percentile 75.000 =   63.230
sockperf: ---> percentile 50.000 =   55.109
sockperf: ---> percentile 25.000 =   47.207
sockperf: ---> <MIN> observation =   42.518

192.168.122.213 -> 10.10.0.216
root@ubuntu:~/sockperf# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-0.git737d1e3e5576 == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=20648; ReceivedMessages=20647
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=11407; ReceivedMessages=11407
sockperf: ====> avg-latency=24.080 (std-dev=0.909)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 24.080 usec
sockperf: Total 11407 observations; each percentile contains 114.07 observations
sockperf: ---> <MAX> observation =   36.168
sockperf: ---> percentile 99.999 =   36.168
sockperf: ---> percentile 99.990 =   35.577
sockperf: ---> percentile 99.900 =   30.959
sockperf: ---> percentile 99.000 =   28.231
sockperf: ---> percentile 90.000 =   24.860
sockperf: ---> percentile 75.000 =   24.321
sockperf: ---> percentile 50.000 =   23.944
sockperf: ---> percentile 25.000 =   23.768
sockperf: ---> <MIN> observation =   21.944

192.168.122.213 -> 10.10.0.217
root@ubuntu:~/sockperf# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-0.git737d1e3e5576 == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10316; ReceivedMessages=10315
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=5950; ReceivedMessages=5950
sockperf: ====> avg-latency=46.208 (std-dev=1.320)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 46.208 usec
sockperf: Total 5950 observations; each percentile contains 59.50 observations
sockperf: ---> <MAX> observation =   62.353
sockperf: ---> percentile 99.999 =   62.353
sockperf: ---> percentile 99.990 =   61.527
sockperf: ---> percentile 99.900 =   57.462
sockperf: ---> percentile 99.000 =   52.314
sockperf: ---> percentile 90.000 =   46.679
sockperf: ---> percentile 75.000 =   46.213
sockperf: ---> percentile 50.000 =   45.972
sockperf: ---> percentile 25.000 =   45.672
sockperf: ---> <MIN> observation =   44.708
lnsyyj commented 2 years ago

The performance of vitastor in T1Q1 is higher than ceph, but the performance of T16Q4 is not as high as ceph. Why is this?

vitalif commented 2 years ago

No idea. In my tests it's faster in all modes. T16Q4? How do you test ceph in T16Q4? fio -ioengine=rbd -numjobs=16 -rbdname=...? I'm adking because this case is particularly slow in ceph i.e -numjobs > 1 into the same RBD image leads to ~10x performance drop in ceph

lnsyyj commented 2 years ago

T1Q1 and T16Q4 are all tested in a virtual machine with 16 cores and 32GB memory. The virtual machine uses libaio to read and write virtual disks (vdb...). The fio librbd engine is not used for testing, because ceph rbd will have an exclusive-lock, which leads to serious degradation of multi-threaded performance.

vitalif commented 2 years ago

Ok, so how much do you get, from both vitastor and ceph?

lnsyyj commented 2 years ago

ceph

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 32K randwrite 17.4k 570MB/s 3675.78 (usec)
100GB 32K randread 34.1k 1118MB/s 1871.4 (usec)
100GB 64K randwrite 11.1k 727MB/s 5762.32 (usec)
100GB 64K randread 27.4k 1798MB/s 2328.34 (usec)

vitastor

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 32K randwrite 11.3k 372MB/s 5639.3 (usec)
100GB 32K randread 25.3k 828MB/s 2529.64 (usec)
100GB 64K randwrite 6214 407MB/s 10293.83 (usec)
100GB 64K randread 14.9k 974MB/s 4303.9 (usec)
vitalif commented 2 years ago

I think that's due to rather high block size. There are 2 issues 1) Vitastor client is single-threaded so it may hit B/W limits faster (at least, without RDMA) - this is probably impacting random reads 2) Vitastor has double write for all writes < 128K (partial object writes) i.e. all of them are journaled. Ceph on SSDs has bluestore_prefer_deferred_size_ssd=0 so it redirect-writes everything and doesn't have double-write. Which is, again, better for higher block sizes (but not so good for small writes because it requires clever defragmentation algorithms). You can recheck with 128K block size and compare if it improves

vitalif commented 2 years ago

Single-threaded TCP caps around 1200 MB/s +- (depending on exact CPU model), so it's a noticeable limit. And 90% of the CPU load is just memory copying. I'm even thinking about integrating VPP support to overcome it :-)) it may be interesting even if it just removes the need to copy memory, even without improving latency

lnsyyj commented 2 years ago

That's very interesting, this is the performance result of the test. The same hardware, the virtual machine is 16 core 32GB, the virtual disk size is 100GB, the virtual machine system is ubuntu18.04, fio T16Q4.

vitastor

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 28.2k 116MB/s 2263.44 (usec)
100GB 4K randread 49.7k 204MB/s 1282.78 (usec)
100GB 8K randwrite 25.2k 206MB/s 2536.37 (usec)
100GB 8K randread 49.4k 405MB/s 1290.51 (usec)
100GB 16K randwrite 19.9k 327MB/s 3205.88 (usec)
100GB 16K randread 37.0k 622MB/s 1681.16 (usec)
100GB 32K randwrite 11.3k 372MB/s 5639.3 (usec)
100GB 32K randread 25.3k 828MB/s 2529.64 (usec)
100GB 64K randwrite 6214 407MB/s 10293.83 (usec)
100GB 64K randread 14.9k 974MB/s 4303.9 (usec)
100GB 128K randwrite 4658 611MB/s 13734.69 (usec)
100GB 128K randread 7814 1024MB/s 8185.45 (usec)
100GB 256K randwrite 2343 614MB/s 27309.35 (usec)
100GB 256K randread 3566 935MB/s 17940.26 (usec)
100GB 512K randwrite 1154 605MB/s 55.45 (msec)
100GB 512K randread 1652 866MB/s 38.72 (msec)
100GB 1M randwrite 548 575MB/s 116.76 (msec)
100GB 1M randread 845 886MB/s 75.7 (msec)

ceph

Virtual Disk Size I/O Block Size I/O Mode IOPS BW Latency
100GB 4K randwrite 21.5k 88.2MB/s 2965.98 (usec)
100GB 4K randread 46.7k 191MB/s 1364.4 (usec)
100GB 8K randwrite 19.8k 162MB/s 3222.98 (usec)
100GB 8K randread 45.2k 370MB/s 1411.31 (usec)
100GB 16K randwrite 19.6k 320MB/s 3268.26 (usec)
100GB 16K randread 39.8k 651MB/s 1605.32 (usec)
100GB 32K randwrite 17.4k 570MB/s 3675.78 (usec)
100GB 32K randread 34.1k 1118MB/s 1871.4 (usec)
100GB 64K randwrite 11.1k 727MB/s 5762.32 (usec)
100GB 64K randread 27.4k 1798MB/s 2328.34 (usec)
100GB 128K randwrite 5649 741MB/s 11322.44 (usec)
100GB 128K randread 18.8k 2469MB/s 3392.48 (usec)
100GB 256K randwrite 2846 746MB/s 22474.51 (usec)
100GB 256K randread 10.1k 2661MB/s 6300.06 (usec)
100GB 512K randwrite 1427 748MB/s 44.82 (msec)
100GB 512K randread 5120 2685MB/s 12493 (usec)
100GB 1M randwrite 708 743MB/s 90.35 (msec)
100GB 1M randread 2567 2692MB/s 24922.74 (usec)
vitalif commented 2 years ago

It seems I found a way to easily improve client throughput, it's io_uring async execution via worker threads. It makes latency slightly worse, but only slightly, i.e. it only adds 5-10us (a context switch). I'll do some more tests and then I think I'll add options to enable it :-)

lnsyyj commented 2 years ago

It's great, I'm dealing with some other things recently and I believe I will be back soon.

vitalif commented 2 years ago

I finally found an idiotic problem which was more than halving the send speed. :-) I was testing linear reads from a non-existing inode (basically noop reads without actual disk I/O) and I discovered that the speed was only 400-500 MB/s instead of 1200 MB/s (10 Gbit/s) with the real OSD, while with stub_uring_osd it was 1200 MB/s. Digged for the problem for about a week :-) And finally........... The problem was listen(listen_fd, <uninitialized variable>) in the real OSD compared to listen(listen_fd, 128) in stub_uring_osd :D And this thing was slowing down socket sends. I don't know HOW. :-) So... I fixed it in master, can you please retest )))

lnsyyj commented 2 years ago

Great, cool! This is already great! I will find time to test it.