[performance] Performance degradation over vitastor 32k

lnsyyj commented 2 years ago

Hi @vitalif , I tested the performance of vitastor without RDMA.

Below 32k (4k, 8k, 16k performance is higher than ceph, latency is lower than ceph) Above 32k, including 32k (32k 64k 128k 256k 512k 1m performance is lower than ceph, latency is higher than ceph) The above rand and sequence have been tested, and the results are the same.

Does it have a special meaning for 32k? Can I adjust some parameters to improve the performance above 32k?

vitalif commented 2 years ago

I don't think so... There is a special meaning for 128k - it's the block (object) size. I.e. writes smaller than 128k are first written to the journal (so they result in additional WA) and writes starting with 128k are written using redirect-write. 32k-1M performance lower than ceph is strange, can you provide any details?

lnsyyj commented 2 years ago

I have 3 physical machines, each with 1 NVME, and 3 machines form an etcd cluster. The vitastor pool uses 3 copies (16 PGs). Use libvirt to create a ubuntu18.08 virtual machine through qemu, create two inodes in vitastor(each 100GB), one as the system disk and the other as the fio performance test disk. Use fio in the virtual machine to test the performance test disk.

OSD parameter

root@vita-3:~# systemctl cat vitastor-osd3.service 
# /etc/systemd/system/vitastor-osd3.service
[Unit]
Description=Vitastor object storage daemon osd.3
After=network-online.target local-fs.target time-sync.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=vitastor.target

[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
LimitMEMLOCK=infinity
ExecStart=/usr/bin/vitastor-osd \
    --etcd_address 192.168.1.217:2379/v3 \
    --bind_address 192.168.1.217 \
    --osd_num 3 \
    --disable_data_fsync 1 \
    --flusher_count 8 \
    --immediate_commit all \
    --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 \
    --journal_no_same_sector_overwrites true \
    --journal_sector_buffer_count 1024 \
        --data_device /dev/disk/by-partuuid/2e2c938b-5f88-4354-8db6-7bac132b5b67     --journal_offset 0     --meta_offset 16777216     --data_offset 798183424 
WorkingDirectory=/
ExecStartPre=+chown vitastor:vitastor /dev/disk/by-partuuid/2e2c938b-5f88-4354-8db6-7bac132b5b67
User=vitastor
PrivateTmp=false
TasksMax=infinity
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=vitastor.target

fio parameter

root@yujiang:~# cat run_fio_multhread_iscsi.sh 
#!/bin/bash
set -x

#SHELL_FOLDER=$(dirname "$0")
#source ${SHELL_FOLDER}/common_variable_iscsi

IOENGINE="libaio"

FILE_NAME="/dev/vdb"

RW=(randwrite randread)
BS=(4k 8k 16k 32k 64k 128k 256k 512k 1m)

NUMJOBS=16
DIRECT=1
SIZE="100g"
IODEPTH=4
RUNTIME=300

for rw_name in ${RW[@]}
do
  for bs_size in ${BS[@]}
  do
    fio -ioengine=${IOENGINE} -numjobs=${NUMJOBS} -direct=${DIRECT} -size=${SIZE} -iodepth=${IODEPTH} -runtime=${RUNTIME} -rw=${rw_name} -ba=${bs_size} -bs=${bs_size} -filename=${FILE_NAME} -name="iscsi_${IOENGINE}_${IODEPTH}_${rw_name}_${bs_size}" -group_reporting > iscsi_${IOENGINE}_${IODEPTH}_${rw_name}_${bs_size}.log
    sleep 60
  done
done

lnsyyj commented 2 years ago

ceph

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	4K	randwrite	19.7k	80.7MB/s	3243.62 (usec)
100GB	4K	randread	40.7k	167MB/s	1566.41 (usec)
100GB	8K	randwrite	17.6k	145MB/s	3623.02 (usec)
100GB	8K	randread	39.8k	326MB/s	1602.89 (usec)
100GB	16K	randwrite	16.2k	266MB/s	3940.96 (usec)
100GB	16K	randread	33.4k	548MB/s	1910.04 (usec)
100GB	32K	randwrite	14.4k	471MB/s	4451.7 (usec)
100GB	32K	randread	28.1k	919MB/s	2276.41 (usec)
100GB	64K	randwrite	12.6k	825MB/s	5076.83 (usec)
100GB	64K	randread	16.0k	1049MB/s	3993.57 (usec)
100GB	128K	randwrite	6919	907MB/s	9243.13 (usec)
100GB	128K	randread	8002	1049MB/s	7991.51 (usec)
100GB	256K	randwrite	3521	923MB/s	18.17 (msec)
100GB	256K	randread	4001	1049MB/s	15987.12 (usec)
100GB	512K	randwrite	1741	913MB/s	36.73 (msec)
100GB	512K	randread	2000	1049MB/s	31979.61 (usec)
100GB	1M	randwrite	871	914MB/s	73.43 (msec)
100GB	1M	randread	1000	1049MB/s	63964.36 (usec)

vitastor

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	4K	randwrite	30.2k	124MB/s	2113.36 (usec)
100GB	4K	randread	46.2k	189MB/s	1379.11 (usec)
100GB	8K	randwrite	27.1k	222MB/s	2357.34 (usec)
100GB	8K	randread	44.4k	364MB/s	1435.87 (usec)
100GB	16K	randwrite	20.6k	337MB/s	3103.98 (usec)
100GB	16K	randread	35.9k	589MB/s	1776.66 (usec)
100GB	32K	randwrite	10.9k	358MB/s	5851.9 (usec)
100GB	32K	randread	24.3k	797MB/s	2621.98 (usec)
100GB	64K	randwrite	6020	395MB/s	10618.76 (usec)
100GB	64K	randread	14.3k	937MB/s	4466.6 (usec)
100GB	128K	randwrite	4971	652MB/s	12861.98 (usec)
100GB	128K	randread	7984	1047MB/s	8005.8 (usec)
100GB	256K	randwrite	2477	649MB/s	25819.94 (usec)
100GB	256K	randread	3556	932MB/s	17988.72 (usec)
100GB	512K	randwrite	1223	641MB/s	52.3 (msec)
100GB	512K	randread	1733	909MB/s	36920.56 (usec)
100GB	1M	randwrite	622	653MB/s	102.72 (msec)
100GB	1M	randread	927	973MB/s	68.97 (msec)

vitalif commented 2 years ago

First of all I think you should remove --flusher_count option, in new versions it doesn't require manual tuning, because it's auto-tuned between 1 and max_flusher_count, and --flusher_count actually sets max_flusher_count So can you remove the option and retest?

lnsyyj commented 2 years ago

First of all I think you should remove --flusher_count option, in new versions it doesn't require manual tuning, because it's auto-tuned between 1 and max_flusher_count, and --flusher_count actually sets max_flusher_count So can you remove the option and retest?

Thank you, this result is actually the result of not adding the flusher_count parameter. It has not been tested after adding this parameter.

lnsyyj commented 2 years ago

The following are the test results of T1Q1

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	4K	randwrite	3012	12.3MB/s	327.57 (usec)
100GB	4K	randread	2997	12.3MB/s	329.13 (usec)
100GB	8K	randwrite	2189	17.9MB/s	451.76 (usec)
100GB	8K	randread	2135	17.5MB/s	463.5 (usec)
100GB	16K	randwrite	1859	30.5MB/s	532.83 (usec)
100GB	16K	randread	1681	27.6MB/s	589.69 (usec)
100GB	32K	randwrite	1499	49.1MB/s	661.65 (usec)
100GB	32K	randread	1491	48.9MB/s	665.35 (usec)
100GB	64K	randwrite	1195	78.3MB/s	831.32 (usec)
100GB	64K	randread	1398	91.6MB/s	710.29 (usec)
100GB	128K	randwrite	1204	158MB/s	824.82 (usec)
100GB	128K	randread	1144	150MB/s	868.99 (usec)
100GB	256K	randwrite	1017	267MB/s	977.98 (usec)
100GB	256K	randread	1070	281MB/s	928.77 (usec)
100GB	512K	randwrite	747	392MB/s	1333.52 (usec)
100GB	512K	randread	973	510MB/s	1022.11 (usec)
100GB	1M	randwrite	484	509MB/s	2056.96 (usec)
100GB	1M	randread	774	812MB/s	1285.82 (usec)

In the README, your latency can reach 0.14ms, how can I achieve your effect?

T1Q1 write: 7087 iops (0.14ms latency)
T1Q1 read: 6838 iops (0.145ms latency)

Is 0.32ms-0.14ms related to RDMA?

vitalif commented 2 years ago

Best latency is achieved when you disable powersaving, cpupower idle-set -D 0 && cpupower frequency-set --governor performance. You can also pin each vitastor osd to 1 core and disable powersaving only for these cores

If it doesn't help then your network may be slow

Regarding writes, I still think your performance is too low. What if you run 1 fio job with rw=write bs=4M iodepth=4?

vitalif commented 2 years ago

0.14 was achieved without rdma

lnsyyj commented 2 years ago

After the CPU performance is turned on, the effect is not obvious.	Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW
100GB	4K	randwrite	3317	13.6MB/s	297.46 (usec)
100GB	4K	randread	3143	12.9MB/s	313.77 (usec)
100GB	8K	randwrite	2207	18.1MB/s	447.97 (usec)
100GB	8K	randread	2268	18.6MB/s	435.96 (usec)
100GB	16K	randwrite	1894	31.0MB/s	522.65 (usec)
100GB	16K	randread	1755	28.8MB/s	564.65 (usec)
100GB	32K	randwrite	1558	51.1MB/s	636.63 (usec)
100GB	32K	randread	1548	50.7MB/s	640.94 (usec)
100GB	64K	randwrite	1233	80.9MB/s	805.39 (usec)
100GB	64K	randread	1422	93.3MB/s	697.73 (usec)
100GB	128K	randwrite	1226	161MB/s	809.83 (usec)
100GB	128K	randread	1170	153MB/s	848.97 (usec)
100GB	256K	randwrite	1019	267MB/s	975.74 (usec)
100GB	256K	randread	1158	304MB/s	858.06 (usec)
100GB	512K	randwrite	768	403MB/s	1295.93 (usec)
100GB	512K	randread	981	515MB/s	1013.64 (usec)
100GB	1M	randwrite	507	532MB/s	1966.59 (usec)
100GB	1M	randread	768	806MB/s	1296.3 (usec)

Make our CPU model different, the performance is lower than your result? My cpu is

2x Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (10 cores @ 2.80GHz)

Your CPU is

2x Xeon Gold 6242 (16 cores @ 2.8 GHz)

Should each NVME be made into 2 OSDs?

vitalif commented 2 years ago

Yes, 2 OSDs per NVMe should help bandwidth with large block sizes. Vitastor OSD is single-threaded, so you're probably just hitting the single thread b/w limit due to memory copying. RDMA would also help because it eliminates 1 extra copy of 2. DPDK could also help but I haven't found usable TCP (or something else...) implementations for it so it's not possible to implement it yet. Regarding latency, 3300 T1Q1 iops is still rather low, I'd expect at least 5000-5500 iops on average hardware. CPU model doesn't affect latency that much. Given that the effect of turning CPU powersaving off is not noticeable then I think it just doesn't work, something probably overrides that setting and it just doesn't get applied. For example RHEL's tuned may override it. You can also tune interrupts & NUMA, enable/disable irqbalance and so on. In one of my tests simply enabling irqbalance raised the bandwidth 3x times :-)

vitalif commented 2 years ago

You can also check your network with sockperf. sockperf sr --tcp on one node, sockperf pp -i SERVER_IP --tcp -m 4096 on the other one. Note that it reports half of the roundtrip (one-way latency).

lnsyyj commented 2 years ago

In fact, the IOPS increased by 3 times when the CPU performance mode is turned on compared to when the performance mode is not turned on. I have also turned it on before, and the above two results were performed with the CPU performance mode turned on.

What should I do about interrupts & NUMA & enable/disable irqbalance? I am not very familiar with this aspect.

sockperf result

sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 192.168.1.216   PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=9605; ReceivedMessages=9604
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=4265; ReceivedMessages=4265
sockperf: ====> avg-latency=64.476 (std-dev=1.374)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 64.476 usec
sockperf: Total 4265 observations; each percentile contains 42.65 observations
sockperf: ---> <MAX> observation =   76.196
sockperf: ---> percentile 99.999 =   76.196
sockperf: ---> percentile 99.990 =   76.196
sockperf: ---> percentile 99.900 =   74.508
sockperf: ---> percentile 99.000 =   68.758
sockperf: ---> percentile 90.000 =   65.041
sockperf: ---> percentile 75.000 =   64.607
sockperf: ---> percentile 50.000 =   64.467
sockperf: ---> percentile 25.000 =   64.329
sockperf: ---> <MIN> observation =   53.082

vitalif commented 2 years ago

You have 65 us one-way latency, so your roundtrip time is 130 us. Theoretical performance of vitastor for replicated writes is 2 RTT + 1 Disk write = 260 us+disk. Disk is I think around 10-40 us. So around 270-300 us in total = 3300-3700 iops T1Q1... That's similar to what you get. :-) What network do you use? Average 10G network should be around twice faster. :-)

Regarding interrupts etc, here is a good article https://blog.cloudflare.com/how-to-achieve-low-latency/ irqbalance service (installed via package manager) distributes network interrupts over all available cores which may be worse than manually pinning them to different cores of 1 CPU if the server has multiple CPUs. I.e. interrupts should be on the CPU to which the corresponding PCIe lanes are connected. Otherwise CPU will "proxy" memory requests through the other CPU and it will slow things down. On the other hand if you don't have irqbalance at all then interrupts may be not distributed over multiple CPU cores at all which will also lower the performance...

You can also try to disable offloads and interrupt coalescing (in case they're enabled), like ethtool -K enp3s0f0 gro off gso off tso off lro off sg off; ethtool -C enp3s0f0 rx-usecs 0.

lnsyyj commented 2 years ago

Great, I use tuned and installed irqbalance to reduce the network latency between vitastor hosts to 22 usec.

sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 192.168.1.217   PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=22594; ReceivedMessages=22593
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12442; ReceivedMessages=12442
sockperf: ====> avg-latency=22.069 (std-dev=0.871)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 22.069 usec
sockperf: Total 12442 observations; each percentile contains 124.42 observations
sockperf: ---> <MAX> observation =   74.262
sockperf: ---> percentile 99.999 =   74.262
sockperf: ---> percentile 99.990 =   63.500
sockperf: ---> percentile 99.900 =   27.764
sockperf: ---> percentile 99.000 =   24.001
sockperf: ---> percentile 90.000 =   22.471
sockperf: ---> percentile 75.000 =   22.075
sockperf: ---> percentile 50.000 =   21.939
sockperf: ---> percentile 25.000 =   21.828
sockperf: ---> <MIN> observation =   21.307

Adjust the network card interruption ethtool -K enp3s0f0 gro off gso off tso off lro off sg off; ethtool -C enp3s0f0 rx-usecs 0 4K randwrite IOPS increased to 3638, 270.60 (usec)

vitalif commented 2 years ago

Latency from inside VM doesn't matter because vitastor client connects to OSDs from the host

lnsyyj commented 2 years ago

Latency from inside VM doesn't matter because vitastor client connects to OSDs from the host

Thanks, I get it. Then I am going to split a SSD into 2 OSDs to see if the latency will be reduced.

vitalif commented 2 years ago

Q1 latency won't be reduced with split, but throughput with large iodepth should improve, yeah :-) By the way, what's the Q1T1 latency now after applying tuned fixes?

lnsyyj commented 2 years ago

When tuned-adm profile network-latency, the latency is reduced, but the throughput is also reduced.

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	4K	randwrite	3638	14.9MB/s	270.6 (usec)
100GB	4K	randread	4095	16.8MB/s	240.27 (usec)
100GB	8K	randwrite	3271	26.8MB/s	301.54 (usec)
100GB	8K	randread	3414	27.0MB/s	288.84 (usec)
100GB	16K	randwrite	2732	44.8MB/s	361.83 (usec)
100GB	16K	randread	2596	42.5MB/s	380.87 (usec)
100GB	32K	randwrite	2027	66.4MB/s	488.94 (usec)
100GB	32K	randread	2243	73.5MB/s	441.61 (usec)
100GB	64K	randwrite	1401	91.8MB/s	709.41 (usec)
100GB	64K	randread	1812	119MB/s	547.71 (usec)
100GB	128K	randwrite	1134	149MB/s	877.56 (usec)
100GB	128K	randread	1384	181MB/s	718.4 (usec)
100GB	256K	randwrite	864	227MB/s	1152.79 (usec)
100GB	256K	randread	1143	300MB/s	870.33 (usec)
100GB	512K	randwrite	581	305MB/s	1716.65 (usec)
100GB	512K	randread	905	475MB/s	1100.02 (usec)
100GB	1M	randwrite	355	373MB/s	2806.67 (usec)
100GB	1M	randread	617	647MB/s	1615.37 (usec)

vitalif commented 2 years ago

3300 -> 3600 iops isn't a good reduce). Something in the network is probably still slow...

vitalif commented 2 years ago

you can check out what osds say in their logs about average latency 'avg latency for op xxx' is average execution latency for the operation, excluding roundtrip time from and to the client, but including latency for suboperations triggered by the primary osd on secondaries. for the "secondary" ops like "write_stable" it's just disk latency because they don't generate subops 'avg latency for subop xxx' is average execution latency for the sub-operation including the roundtrip to secondary OSD

vitalif commented 2 years ago

so in your case 'write_stable op latency' is around 60us (disk), and 'write_stable subop latency' is 160-200us, so rtt is still 100-140us. why is it so slow?) what network do you use?

lnsyyj commented 2 years ago

My configuration is

switch: S6720-SI Series Multi GE Switches
CPU: 2x Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (10 cores @ 2.80GHz)
Network card: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)

Very strange, why my network latency is so high. I don't have the latency data of the network card and switch, and I am not sure how much the system is wasting.

Does OSD ping log mean that there is no I/O?

Oct 19 06:38:00 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 115 us
Oct 19 06:38:06 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 117 us
Oct 19 06:38:09 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 110 us
Oct 19 06:38:15 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 124 us
Oct 19 06:38:21 vita-3 vitastor-osd[37215]: [OSD 3] avg latency for subop 15 (ping): 113 us

vitalif commented 2 years ago

Idle pings may be slower than 'real' even when CPU powersave is turned off. I get 60 us idle ping time between 2 osds on localhost even with powersave disabled, but it drops to ~35 us when testing T1Q1 randread. But in your case 120 us seems similar to the "normal" ping, i.e. similar to just the network RTT in your case... how high is your ping latency when testing T1Q1 randread?

lnsyyj commented 2 years ago

When T1Q1 randread 4K, fio runs for 5 minutes, starting at 21:45:04. The ping delay did not increase, but when fio ended, the OSDping delay on vita-2 increased.

I have uploaded log to: https://github.com/lnsyyj/vitastor-log/tree/main/vitastor-osd-log/20211020

lnsyyj commented 2 years ago

When randwrite 64k, it appeared in the osd log. Is this a journal collection?

Oct 20 03:12:55 vita-3 vitastor-osd[1679]: Ran out of journal space (used_start=0086f000, next_free=0084d000, dirty_start=0083c000)

vitalif commented 2 years ago

When randwrite 64k, it appeared in the osd log. Is this a journal collection? As long as it doesn't block or noticeably slow down writes it's OK. It means that OSD writes journal entries faster than it flushes old entries. You can try to raise journal size, but it may not remove these messages because it's the speed that matters, not just the free space. Anyway some amount of these messages is OK :-)

vitalif commented 2 years ago

I have uploaded log to: https://github.com/lnsyyj/vitastor-log/tree/main/vitastor-osd-log/20211020

Yes it seems ping time is still about 90-100us even when OSDs are active... so it's still the same, network problem should be found and fixed to improve latency :-)

I also tested vitastor with a single NVMe and yes, it seems it requires 2-3 OSDs per NVMe to fully utilize linear write speed, just because of memory copying.

lnsyyj commented 2 years ago

I think about how to reduce network latency, but maybe my hardware has reached its limit.

vitalif commented 2 years ago

I think about how to reduce network latency, but maybe my hardware has reached its limit.

But at the same time you say sockperf shows ~22us? Can you recheck sockperf between OSD nodes and between OSD and client?

lnsyyj commented 2 years ago

I have 3 servers in a vitastor cluster, and a virtual machine runs on one of them. vita-1 10.10.0.215 vita-2 10.10.0.216 virtual machine running node vita-3 10.10.0.217 ubuntu18.04 virtual machine 192.168.122.213

The following are the results of sockperf running between physical nodes:

10.10.0.215 -> 10.10.0.215
root@vita-1:~# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=66350; ReceivedMessages=66349
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.546 sec; SentMessages=36325; ReceivedMessages=36325
sockperf: ====> avg-latency=7.484 (std-dev=0.157)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 7.484 usec
sockperf: Total 36325 observations; each percentile contains 363.25 observations
sockperf: ---> <MAX> observation =   15.054
sockperf: ---> percentile 99.999 =   15.054
sockperf: ---> percentile 99.990 =   11.008
sockperf: ---> percentile 99.900 =    9.398
sockperf: ---> percentile 99.000 =    7.791
sockperf: ---> percentile 90.000 =    7.603
sockperf: ---> percentile 75.000 =    7.538
sockperf: ---> percentile 50.000 =    7.466
sockperf: ---> percentile 25.000 =    7.407
sockperf: ---> <MIN> observation =    6.336

10.10.0.215 -> 10.10.0.216
root@vita-1:~# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23199; ReceivedMessages=23198
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12788; ReceivedMessages=12788
sockperf: ====> avg-latency=21.477 (std-dev=0.384)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.477 usec
sockperf: Total 12788 observations; each percentile contains 127.88 observations
sockperf: ---> <MAX> observation =   29.251
sockperf: ---> percentile 99.999 =   29.251
sockperf: ---> percentile 99.990 =   28.595
sockperf: ---> percentile 99.900 =   27.234
sockperf: ---> percentile 99.000 =   22.760
sockperf: ---> percentile 90.000 =   21.633
sockperf: ---> percentile 75.000 =   21.530
sockperf: ---> percentile 50.000 =   21.434
sockperf: ---> percentile 25.000 =   21.349
sockperf: ---> <MIN> observation =   20.915

10.10.0.215 -> 10.10.0.217
root@vita-1:~# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23040; ReceivedMessages=23039
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12695; ReceivedMessages=12695
sockperf: ====> avg-latency=21.640 (std-dev=1.140)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.640 usec
sockperf: Total 12695 observations; each percentile contains 126.95 observations
sockperf: ---> <MAX> observation =   84.146
sockperf: ---> percentile 99.999 =   84.146
sockperf: ---> percentile 99.990 =   77.598
sockperf: ---> percentile 99.900 =   28.659
sockperf: ---> percentile 99.000 =   24.601
sockperf: ---> percentile 90.000 =   21.860
sockperf: ---> percentile 75.000 =   21.626
sockperf: ---> percentile 50.000 =   21.495
sockperf: ---> percentile 25.000 =   21.386
sockperf: ---> <MIN> observation =   20.845

10.10.0.216 -> 10.10.0.215
root@vita-2:~# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23230; ReceivedMessages=23229
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12783; ReceivedMessages=12783
sockperf: ====> avg-latency=21.489 (std-dev=0.607)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.489 usec
sockperf: Total 12783 observations; each percentile contains 127.83 observations
sockperf: ---> <MAX> observation =   30.769
sockperf: ---> percentile 99.999 =   30.769
sockperf: ---> percentile 99.990 =   29.548
sockperf: ---> percentile 99.900 =   27.635
sockperf: ---> percentile 99.000 =   24.835
sockperf: ---> percentile 90.000 =   21.720
sockperf: ---> percentile 75.000 =   21.528
sockperf: ---> percentile 50.000 =   21.382
sockperf: ---> percentile 25.000 =   21.271
sockperf: ---> <MIN> observation =   20.840

10.10.0.216 -> 10.10.0.216
root@vita-2:~# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=48581; ReceivedMessages=48580
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=26678; ReceivedMessages=26678
sockperf: ====> avg-latency=10.271 (std-dev=0.200)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 10.271 usec
sockperf: Total 26678 observations; each percentile contains 266.78 observations
sockperf: ---> <MAX> observation =   17.884
sockperf: ---> percentile 99.999 =   17.884
sockperf: ---> percentile 99.990 =   14.199
sockperf: ---> percentile 99.900 =   12.691
sockperf: ---> percentile 99.000 =   10.743
sockperf: ---> percentile 90.000 =   10.389
sockperf: ---> percentile 75.000 =   10.320
sockperf: ---> percentile 50.000 =   10.251
sockperf: ---> percentile 25.000 =   10.188
sockperf: ---> <MIN> observation =    9.932

10.10.0.216 -> 10.10.0.217
root@vita-2:~# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=21431; ReceivedMessages=21430
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=11795; ReceivedMessages=11795
sockperf: ====> avg-latency=23.300 (std-dev=1.512)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 23.300 usec
sockperf: Total 11795 observations; each percentile contains 117.95 observations
sockperf: ---> <MAX> observation =  126.662
sockperf: ---> percentile 99.999 =  126.662
sockperf: ---> percentile 99.990 =   97.480
sockperf: ---> percentile 99.900 =   29.805
sockperf: ---> percentile 99.000 =   26.100
sockperf: ---> percentile 90.000 =   23.848
sockperf: ---> percentile 75.000 =   23.254
sockperf: ---> percentile 50.000 =   23.089
sockperf: ---> percentile 25.000 =   22.974
sockperf: ---> <MIN> observation =   22.592

10.10.0.217 -> 10.10.0.215
root@vita-3:~# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=23042; ReceivedMessages=23041
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=12689; ReceivedMessages=12689
sockperf: ====> avg-latency=21.618 (std-dev=0.493)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 21.618 usec
sockperf: Total 12689 observations; each percentile contains 126.89 observations
sockperf: ---> <MAX> observation =   29.375
sockperf: ---> percentile 99.999 =   29.375
sockperf: ---> percentile 99.990 =   29.023
sockperf: ---> percentile 99.900 =   27.652
sockperf: ---> percentile 99.000 =   23.687
sockperf: ---> percentile 90.000 =   21.809
sockperf: ---> percentile 75.000 =   21.645
sockperf: ---> percentile 50.000 =   21.529
sockperf: ---> percentile 25.000 =   21.430
sockperf: ---> <MIN> observation =   20.988

10.10.0.217 -> 10.10.0.216
root@vita-3:~# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=21532; ReceivedMessages=21531
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=11849; ReceivedMessages=11849
sockperf: ====> avg-latency=23.176 (std-dev=0.623)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 23.176 usec
sockperf: Total 11849 observations; each percentile contains 118.49 observations
sockperf: ---> <MAX> observation =   34.893
sockperf: ---> percentile 99.999 =   34.893
sockperf: ---> percentile 99.990 =   31.490
sockperf: ---> percentile 99.900 =   29.786
sockperf: ---> percentile 99.000 =   25.420
sockperf: ---> percentile 90.000 =   23.761
sockperf: ---> percentile 75.000 =   23.166
sockperf: ---> percentile 50.000 =   23.002
sockperf: ---> percentile 25.000 =   22.890
sockperf: ---> <MIN> observation =   22.465

10.10.0.217 -> 10.10.0.217
root@vita-3:~# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-no.git == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=48284; ReceivedMessages=48283
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=26512; ReceivedMessages=26512
sockperf: ====> avg-latency=10.335 (std-dev=2.412)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 10.335 usec
sockperf: Total 26512 observations; each percentile contains 265.12 observations
sockperf: ---> <MAX> observation =  361.655
sockperf: ---> percentile 99.999 =  361.655
sockperf: ---> percentile 99.990 =   25.666
sockperf: ---> percentile 99.900 =   17.767
sockperf: ---> percentile 99.000 =   13.521
sockperf: ---> percentile 90.000 =   10.381
sockperf: ---> percentile 75.000 =   10.298
sockperf: ---> percentile 50.000 =   10.225
sockperf: ---> percentile 25.000 =   10.162
sockperf: ---> <MIN> observation =    9.910

The following is the result of sockperf to the physical node inside the virtual machine node:

192.168.122.213 -> 10.10.0.215
root@ubuntu:~/sockperf# sockperf pp -i 10.10.0.215 --tcp -m 4096
sockperf: == version #3.6-0.git737d1e3e5576 == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.215     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=8929; ReceivedMessages=8928
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=4865; ReceivedMessages=4865
sockperf: ====> avg-latency=56.519 (std-dev=11.084)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 56.519 usec
sockperf: Total 4865 observations; each percentile contains 48.65 observations
sockperf: ---> <MAX> observation =  209.568
sockperf: ---> percentile 99.999 =  209.568
sockperf: ---> percentile 99.990 =  209.568
sockperf: ---> percentile 99.900 =  128.706
sockperf: ---> percentile 99.000 =   80.863
sockperf: ---> percentile 90.000 =   71.991
sockperf: ---> percentile 75.000 =   63.230
sockperf: ---> percentile 50.000 =   55.109
sockperf: ---> percentile 25.000 =   47.207
sockperf: ---> <MIN> observation =   42.518

192.168.122.213 -> 10.10.0.216
root@ubuntu:~/sockperf# sockperf pp -i 10.10.0.216 --tcp -m 4096
sockperf: == version #3.6-0.git737d1e3e5576 == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.216     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=20648; ReceivedMessages=20647
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=11407; ReceivedMessages=11407
sockperf: ====> avg-latency=24.080 (std-dev=0.909)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 24.080 usec
sockperf: Total 11407 observations; each percentile contains 114.07 observations
sockperf: ---> <MAX> observation =   36.168
sockperf: ---> percentile 99.999 =   36.168
sockperf: ---> percentile 99.990 =   35.577
sockperf: ---> percentile 99.900 =   30.959
sockperf: ---> percentile 99.000 =   28.231
sockperf: ---> percentile 90.000 =   24.860
sockperf: ---> percentile 75.000 =   24.321
sockperf: ---> percentile 50.000 =   23.944
sockperf: ---> percentile 25.000 =   23.768
sockperf: ---> <MIN> observation =   21.944

192.168.122.213 -> 10.10.0.217
root@ubuntu:~/sockperf# sockperf pp -i 10.10.0.217 --tcp -m 4096
sockperf: == version #3.6-0.git737d1e3e5576 == 
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.10.0.217     PORT = 11111 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=1.000 sec; Warm up time=400 msec; SentMessages=10316; ReceivedMessages=10315
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=0.550 sec; SentMessages=5950; ReceivedMessages=5950
sockperf: ====> avg-latency=46.208 (std-dev=1.320)
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Latency is 46.208 usec
sockperf: Total 5950 observations; each percentile contains 59.50 observations
sockperf: ---> <MAX> observation =   62.353
sockperf: ---> percentile 99.999 =   62.353
sockperf: ---> percentile 99.990 =   61.527
sockperf: ---> percentile 99.900 =   57.462
sockperf: ---> percentile 99.000 =   52.314
sockperf: ---> percentile 90.000 =   46.679
sockperf: ---> percentile 75.000 =   46.213
sockperf: ---> percentile 50.000 =   45.972
sockperf: ---> percentile 25.000 =   45.672
sockperf: ---> <MIN> observation =   44.708

lnsyyj commented 2 years ago

The performance of vitastor in T1Q1 is higher than ceph, but the performance of T16Q4 is not as high as ceph. Why is this?

vitalif commented 2 years ago

No idea. In my tests it's faster in all modes. T16Q4? How do you test ceph in T16Q4? fio -ioengine=rbd -numjobs=16 -rbdname=...? I'm adking because this case is particularly slow in ceph i.e -numjobs > 1 into the same RBD image leads to ~10x performance drop in ceph

lnsyyj commented 2 years ago

T1Q1 and T16Q4 are all tested in a virtual machine with 16 cores and 32GB memory. The virtual machine uses libaio to read and write virtual disks (vdb...). The fio librbd engine is not used for testing, because ceph rbd will have an exclusive-lock, which leads to serious degradation of multi-threaded performance.

vitalif commented 2 years ago

Ok, so how much do you get, from both vitastor and ceph?

lnsyyj commented 2 years ago

ceph

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	32K	randwrite	17.4k	570MB/s	3675.78 (usec)
100GB	32K	randread	34.1k	1118MB/s	1871.4 (usec)
100GB	64K	randwrite	11.1k	727MB/s	5762.32 (usec)
100GB	64K	randread	27.4k	1798MB/s	2328.34 (usec)

vitastor

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	32K	randwrite	11.3k	372MB/s	5639.3 (usec)
100GB	32K	randread	25.3k	828MB/s	2529.64 (usec)
100GB	64K	randwrite	6214	407MB/s	10293.83 (usec)
100GB	64K	randread	14.9k	974MB/s	4303.9 (usec)

vitalif commented 2 years ago

I think that's due to rather high block size. There are 2 issues 1) Vitastor client is single-threaded so it may hit B/W limits faster (at least, without RDMA) - this is probably impacting random reads 2) Vitastor has double write for all writes < 128K (partial object writes) i.e. all of them are journaled. Ceph on SSDs has bluestore_prefer_deferred_size_ssd=0 so it redirect-writes everything and doesn't have double-write. Which is, again, better for higher block sizes (but not so good for small writes because it requires clever defragmentation algorithms). You can recheck with 128K block size and compare if it improves

vitalif commented 2 years ago

Single-threaded TCP caps around 1200 MB/s +- (depending on exact CPU model), so it's a noticeable limit. And 90% of the CPU load is just memory copying. I'm even thinking about integrating VPP support to overcome it :-)) it may be interesting even if it just removes the need to copy memory, even without improving latency

lnsyyj commented 2 years ago

That's very interesting, this is the performance result of the test. The same hardware, the virtual machine is 16 core 32GB, the virtual disk size is 100GB, the virtual machine system is ubuntu18.04, fio T16Q4.

vitastor

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	4K	randwrite	28.2k	116MB/s	2263.44 (usec)
100GB	4K	randread	49.7k	204MB/s	1282.78 (usec)
100GB	8K	randwrite	25.2k	206MB/s	2536.37 (usec)
100GB	8K	randread	49.4k	405MB/s	1290.51 (usec)
100GB	16K	randwrite	19.9k	327MB/s	3205.88 (usec)
100GB	16K	randread	37.0k	622MB/s	1681.16 (usec)
100GB	32K	randwrite	11.3k	372MB/s	5639.3 (usec)
100GB	32K	randread	25.3k	828MB/s	2529.64 (usec)
100GB	64K	randwrite	6214	407MB/s	10293.83 (usec)
100GB	64K	randread	14.9k	974MB/s	4303.9 (usec)
100GB	128K	randwrite	4658	611MB/s	13734.69 (usec)
100GB	128K	randread	7814	1024MB/s	8185.45 (usec)
100GB	256K	randwrite	2343	614MB/s	27309.35 (usec)
100GB	256K	randread	3566	935MB/s	17940.26 (usec)
100GB	512K	randwrite	1154	605MB/s	55.45 (msec)
100GB	512K	randread	1652	866MB/s	38.72 (msec)
100GB	1M	randwrite	548	575MB/s	116.76 (msec)
100GB	1M	randread	845	886MB/s	75.7 (msec)

ceph

Virtual Disk Size	I/O Block Size	I/O Mode	IOPS	BW	Latency
100GB	4K	randwrite	21.5k	88.2MB/s	2965.98 (usec)
100GB	4K	randread	46.7k	191MB/s	1364.4 (usec)
100GB	8K	randwrite	19.8k	162MB/s	3222.98 (usec)
100GB	8K	randread	45.2k	370MB/s	1411.31 (usec)
100GB	16K	randwrite	19.6k	320MB/s	3268.26 (usec)
100GB	16K	randread	39.8k	651MB/s	1605.32 (usec)
100GB	32K	randwrite	17.4k	570MB/s	3675.78 (usec)
100GB	32K	randread	34.1k	1118MB/s	1871.4 (usec)
100GB	64K	randwrite	11.1k	727MB/s	5762.32 (usec)
100GB	64K	randread	27.4k	1798MB/s	2328.34 (usec)
100GB	128K	randwrite	5649	741MB/s	11322.44 (usec)
100GB	128K	randread	18.8k	2469MB/s	3392.48 (usec)
100GB	256K	randwrite	2846	746MB/s	22474.51 (usec)
100GB	256K	randread	10.1k	2661MB/s	6300.06 (usec)
100GB	512K	randwrite	1427	748MB/s	44.82 (msec)
100GB	512K	randread	5120	2685MB/s	12493 (usec)
100GB	1M	randwrite	708	743MB/s	90.35 (msec)
100GB	1M	randread	2567	2692MB/s	24922.74 (usec)

vitalif commented 2 years ago

It seems I found a way to easily improve client throughput, it's io_uring async execution via worker threads. It makes latency slightly worse, but only slightly, i.e. it only adds 5-10us (a context switch). I'll do some more tests and then I think I'll add options to enable it :-)

lnsyyj commented 2 years ago

It's great, I'm dealing with some other things recently and I believe I will be back soon.

vitalif commented 2 years ago

I finally found an idiotic problem which was more than halving the send speed. :-) I was testing linear reads from a non-existing inode (basically noop reads without actual disk I/O) and I discovered that the speed was only 400-500 MB/s instead of 1200 MB/s (10 Gbit/s) with the real OSD, while with stub_uring_osd it was 1200 MB/s. Digged for the problem for about a week :-) And finally........... The problem was listen(listen_fd, <uninitialized variable>) in the real OSD compared to listen(listen_fd, 128) in stub_uring_osd :D And this thing was slowing down socket sends. I don't know HOW. :-) So... I fixed it in master, can you please retest )))

lnsyyj commented 2 years ago

Great, cool! This is already great! I will find time to test it.

vitalif / vitastor

[performance] Performance degradation over vitastor 32k #22