scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.28k stars 1.54k forks source link

iotune: Random IO buffer size is not always correct #1698

Open xemul opened 1 year ago

xemul commented 1 year ago

From https://github.com/scylladb/scylladb/issues/13477 : doing random writes test with auto-selected 512 bytes (or even 1k) makes i4i instance's drive ro read-modify-write thus dropping down the resulting IOPS rate. Need to make random writes block size larger ... somehow

avikivity commented 1 year ago

There's the physical block size in /sys/block/nvme0n1/queue/physical_block_size.

Note: we still want to write the commitlog with logical block size, there's hope it avoids RMW since it's a stream.

mykaul commented 1 year ago

There's the physical block size in /sys/block/nvme0n1/queue/physical_block_size.

Note: we still want to write the commitlog with logical block size, there's hope it avoids RMW since it's a stream.

Not optimal_io_size? (assuming it's not 0, which many times it is...)

mykaul commented 1 year ago

Finally found what I was looking for. From https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-optimized-instances.html#storage-instances-diskperf :

This decrease in performance is even larger if the write operations are not in multiples of 4,096 bytes or not aligned to a 4,096-byte boundary. If you write a smaller amount of bytes or bytes that are not aligned, the SSD controller must read the surrounding data and store the result in a new location. This pattern results in significantly increased write amplification, increased latency, and dramatically reduced I/O performance.
mykaul commented 1 year ago

Example from GCP, where sector size is 4096:

        "nvme0n1": {
            "holders": [],
            "host": "Non-Volatile memory controller: Google, Inc. Device 001f (rev 01)",
            "links": {
                "ids": [
                    "google-local-nvme-ssd-0",
                    "nvme-nvme.1ae0-6e766d655f63617264-6e766d655f63617264-00000001"
                ],
                "labels": [],
                "masters": [
                    "md0"
                ],
                "uuids": []
            },
            "model": "nvme_card",
            "partitions": {},
            "removable": "0",
            "rotational": "0",
            "sas_address": null,
            "sas_device_handle": null,
            "scheduler_mode": "none",
            "sectors": "786432000",
            "sectorsize": "4096",   <---- THIS
            "size": "375.00 GB",
            "support_discard": "4096",
            "vendor": null,
mykaul commented 6 months ago

Reviving this issue, in the hope it'll make it to Scylla 6.0. Ping @xemul , @avikivity

xemul commented 4 months ago

Collected from i4i.4xl by @pwrobelse image

pwrobelse commented 4 months ago

Hello, please find some experiments with iotune related to buffer size of random IO. I used i4i.4xlarge instance type and AMI with ScyllaDB 5.4.6. The measurements are a follow-up to issue#13477.

To prepare the machine the following steps were performed:

Step 1: stop scylla-server and verify that it is not running.

scyllaadm@ip-172-31-46-190:~$ sudo systemctl stop scylla-server 
scyllaadm@ip-172-31-46-190:~$ ps aux | grep scylla  
scylla       534  0.0  0.0 1240236 11700 ?       Ssl  10:41   0:00 /opt/scylladb/node_exporter/node_exporter --collector.interrupts
root        4015  0.0  0.0  16928 10968 ?        Ss   10:45   0:00 sshd: scyllaadm [priv]
scyllaa+    4032  0.0  0.0  16932  9792 ?        Ss   10:45   0:00 /lib/systemd/systemd --user
scyllaa+    4033  0.0  0.0 169820  4396 ?        S    10:45   0:00 (sd-pam)
scyllaa+    4048  0.0  0.0  17200  7932 ?        S    10:45   0:00 sshd: scyllaadm@pts/0
scyllaa+    4049  0.0  0.0   5048  4108 pts/0    Ss   10:45   0:00 -bash
scyllaa+    4186  0.0  0.0   7484  3320 pts/0    R+   10:46   0:00 ps aux
scyllaa+    4187  0.0  0.0   4024  2000 pts/0    S+   10:46   0:00 grep --color=auto scylla

Step 2: list available disks:

scyllaadm@ip-172-31-46-190:~$ lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1      259:0    0   30G  0 disk 
├─nvme0n1p1  259:1    0 29.9G  0 part /
├─nvme0n1p14 259:2    0    4M  0 part 
└─nvme0n1p15 259:3    0  106M  0 part /boot/efi
nvme1n1      259:4    0  3.4T  0 disk /var/lib/systemd/coredump
                                      /var/lib/scylla

Step 3: run perftune.py.

scyllaadm@ip-172-31-46-190:~$ sudo /opt/scylladb/scripts/perftune.py --nic eth0 --tune-clock --dir /var/lib/scylla --tune disks --tune net --tune system --dev nvme1n1
irqbalance is not running
No non-NVMe disks to tune
Setting NVMe disks: nvme1n1...
Setting mask 00000001 in /proc/irq/24/smp_affinity
Writing 'none' to /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1/queue/nomerges
Setting a physical interface eth0...
Executing: ethtool -L eth0 rx 2
Executing: ethtool -L eth0 combined 2
Distributing IRQs handling Rx and Tx for first 2 channels:
Setting mask 00000001 in /proc/irq/45/smp_affinity
Setting mask 00000100 in /proc/irq/46/smp_affinity
Distributing the rest of IRQs
Setting mask 0000ffff in /sys/class/net/eth0/queues/rx-1/rps_cpus
Setting mask 0000ffff in /sys/class/net/eth0/queues/rx-0/rps_cpus
Setting net.core.rps_sock_flow_entries to 32768
Setting limit 16384 in /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
Setting limit 16384 in /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
Trying to enable ntuple filtering HW offload for eth0...not supported
Setting mask 00000f0f in /sys/class/net/eth0/queues/tx-0/xps_cpus
Setting mask 0000f0f0 in /sys/class/net/eth0/queues/tx-1/xps_cpus
Writing '4096' to /proc/sys/net/core/somaxconn
Writing '4096' to /proc/sys/net/ipv4/tcp_max_syn_backlog
Setting clocksource to tsc

Step 4: run iotune provided by the AMI - low IOPS of random write is visible (actual: 91k, expected: 200k).

scyllaadm@ip-172-31-46-190:~$ sudo iotune --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 10:49:02,609 seastar - Reactor backend: linux-aio
INFO  2024-05-09 10:49:02,798 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 10:49:02,799 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 1685 MB/s (deviation 5%)
Measuring sequential read bandwidth: 2960 MB/s (deviation 4%)
Measuring random write IOPS: 91700 IOPS
Measuring random read IOPS: 313992 IOPS
Writing result to /tmp/io_properties.yaml

Step 5: check the sizes of block and IO provided by /sys/block/nvme1n1/queue - 4KB is not present:

scyllaadm@ip-172-31-46-190:~$ cat /sys/block/nvme1n1/queue/physical_block_size
512
scyllaadm@ip-172-31-46-190:~$ cat /sys/block/nvme1n1/queue/logical_block_size
512
scyllaadm@ip-172-31-46-190:~$ cat /sys/block/nvme1n1/queue/minimum_io_size
512
scyllaadm@ip-172-31-46-190:~$ cat /sys/block/nvme1n1/queue/optimal_io_size
0

Step 6: build iotune from the latest master and run it with the same command - low IOPS of random write is visible (actual: 91k, expected: 200k).

scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo ./build/release/apps/iotune/iotune --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 11:01:00,909 seastar - Reactor backend: io_uring
INFO  2024-05-09 11:01:01,178 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 11:01:01,179 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
INFO  2024-05-09 11:01:01,180 [shard  0:main] iotune - Filesystem parameters: read alignment 512, write alignment 1024
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 1683 MB/s (deviation 5%)
Measuring sequential read bandwidth: 2946 MB/s (deviation 4%)
Measuring random write IOPS: 91662 IOPS
Measuring random read IOPS: 314742 IOPS
Writing result to /tmp/io_properties.yaml

Step 7: apply patch from PR#2204. Force buffer size to 4096 for random IO with the new parameter.

Interestingly, the behavior was not consistent. Depending on the machine I saw different results in spite of performing exactly the same steps. Random write IOPS either was better (140k instead of 91k, still below expected 200k) or was much worse (47K vs 91K). In total I used 6-7 instances and the increase/decrease of IOPS seemed to occur randomly - the result was either 47k or 140k.

Machine 1 - degradation:

scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo ./build/release/apps/iotune/iotune --random-io-buffer-size 4096 --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 11:05:48,858 seastar - Reactor backend: io_uring
INFO  2024-05-09 11:05:49,070 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 11:05:49,071 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
INFO  2024-05-09 11:05:49,071 [shard  0:main] iotune - Forcing buffer_size=4096 for random IO!
INFO  2024-05-09 11:05:49,072 [shard  0:main] iotune - Filesystem parameters: read alignment 512, write alignment 1024
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 1672 MB/s (deviation 6%)
Measuring sequential read bandwidth: 2945 MB/s (deviation 4%)
Measuring random write IOPS: 47711 IOPS
Measuring random read IOPS: 261883 IOPS
Writing result to /tmp/io_properties.yaml

Machine 2 - improvement:

scyllaadm@ip-172-31-44-9:~/repo/seastar$ sudo ./build/release/apps/iotune/iotune --random-io-buffer-size 4096 --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 11:56:00,265 seastar - Reactor backend: io_uring
INFO  2024-05-09 11:56:00,473 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 11:56:00,474 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
INFO  2024-05-09 11:56:00,474 [shard  0:main] iotune - Forcing buffer_size=4096 for random IO!
INFO  2024-05-09 11:56:00,475 [shard  0:main] iotune - Filesystem parameters: read alignment 512, write alignment 1024
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 1870 MB/s (deviation 15%)
Measuring sequential read bandwidth: 2946 MB/s (deviation 4%)
Measuring random write IOPS: 140260 IOPS
Measuring random read IOPS: 275969 IOPS
Writing result to /tmp/io_properties.yaml

Step 8: re-mount XFS with BS=4096 instead of BS=1024.

When scylla_raid_setup runs mkfs.xfs it uses block_size = max(1024, sector_size). The code can be found here. Therefore, xfs_info returned block size equals 1024. I tried to mount XFS with BS=4096 - it is a default value of block size (source here).

scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo umount /var/lib/scylla
scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo umount /var/lib/systemd/coredump
scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo mkfs.xfs -f -b size=4096 /dev/nvme1n1 -K
meta-data=/dev/nvme1n1           isize=512    agcount=4, agsize=228881836 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=915527343, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=447034, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo mount /dev/nvme1n1 /var/lib/scylla
scyllaadm@ip-172-31-46-190:~/repo/seastar$ lsblk
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1      259:0    0   30G  0 disk 
├─nvme0n1p1  259:1    0 29.9G  0 part /
├─nvme0n1p14 259:2    0    4M  0 part 
└─nvme0n1p15 259:3    0  106M  0 part /boot/efi
nvme1n1      259:4    0  3.4T  0 disk /var/lib/scylla

Step 9: rerun the test with BS=4096 - random write IOPS increased to 240k on both machines.

Machine 1:

scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo ./build/release/apps/iotune/iotune --random-io-buffer-size 4096 --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 11:16:52,469 seastar - Reactor backend: io_uring
INFO  2024-05-09 11:16:52,670 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 11:16:52,671 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
INFO  2024-05-09 11:16:52,671 [shard  0:main] iotune - Forcing buffer_size=4096 for random IO!
INFO  2024-05-09 11:16:52,672 [shard  0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2181 MB/s (deviation 3%)
Measuring sequential read bandwidth: 2950 MB/s (deviation 8%)
Measuring random write IOPS: 242899 IOPS (deviation 12%)
Measuring random read IOPS: 350294 IOPS
Writing result to /tmp/io_properties.yaml

Machine 2:

scyllaadm@ip-172-31-44-9:~/repo/seastar$ sudo ./build/release/apps/iotune/iotune --random-io-buffer-size 4096 --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 11:59:04,329 seastar - Reactor backend: io_uring
INFO  2024-05-09 11:59:04,533 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 11:59:04,534 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
INFO  2024-05-09 11:59:04,534 [shard  0:main] iotune - Forcing buffer_size=4096 for random IO!
INFO  2024-05-09 11:59:04,535 [shard  0:main] iotune - Filesystem parameters: read alignment 512, write alignment 4096
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2181 MB/s (deviation 3%)
Measuring sequential read bandwidth: 2945 MB/s (deviation 8%)
Measuring random write IOPS: 239202 IOPS (deviation 12%)
Measuring random read IOPS: 351342 IOPS
Writing result to /tmp/io_properties.yaml

Step 10: rerun iotune from ScyllaDB 5.4.6 AMI.

The actual block size used by iotune in this case was 4096, because it set buffer_size = std::max(buffer_size, _file.disk_write_dma_alignment());. The alignment returned from posix_file_impl uses block size defined for XFS as write alignment. The result was better - random write IOPS increased from 91k to 240k.

Machine 1:

scyllaadm@ip-172-31-46-190:~/repo/seastar$ sudo iotune --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 11:24:19,329 seastar - Reactor backend: linux-aio
INFO  2024-05-09 11:24:19,514 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 11:24:19,515 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2181 MB/s (deviation 3%)
Measuring sequential read bandwidth: 2969 MB/s (deviation 7%)
Measuring random write IOPS: 240261 IOPS (deviation 12%)
Measuring random read IOPS: 312452 IOPS
Writing result to /tmp/io_properties.yaml

Machine 2:

scyllaadm@ip-172-31-44-9:~/repo/seastar$ sudo iotune --evaluation-directory /var/lib/scylla --properties-file /tmp/io_properties.yaml
INFO  2024-05-09 12:03:04,121 seastar - Reactor backend: linux-aio
INFO  2024-05-09 12:03:04,305 [shard  0:main] iotune - /var/lib/scylla passed sanity checks
INFO  2024-05-09 12:03:04,306 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2181 MB/s (deviation 3%)
Measuring sequential read bandwidth: 2969 MB/s (deviation 7%)
Measuring random write IOPS: 240272 IOPS (deviation 12%)
Measuring random read IOPS: 313831 IOPS
Writing result to /tmp/io_properties.yaml

Summary of the experiments

In spite of performing exactly the same steps on different machines, that used the same exact instance type (i4i.4xlarge) the results from iotune were inconsistent, when block size was only changed on the level of iotune. Using iotune --random-io-buffer-size 4096 either decreased random write IOPS to 47k or increased it to 140k. It looked randomly.

When mkfs.xfs was run with block size 4096, then on both machines I saw 240k random write IOPS.

The table below summarizes the obtained results: iotune binary and backend XFS_block_size iotune_block_size random_write_IOPS
5.4.6 AMI + linux-aio 1024 1024 91k
master + io_uring 1024 1024 91k
master + io_uring 1024 4096 Inconsistent - either 47k or 140k
5.4.6 AMI + linux-aio 4096 4096 240k Note: I am not sure if re-running mkfs.xfs with -f -K corrupts the results
master + io_uring 4096 4096 240k Note: I am not sure if re-running mkfs.xfs with -f -K corrupts the results

The values exposed by /sys/block/nvme1n1/queue did not contain 4096. In the case of mkfs.xfs it is a hard-coded default value.

mykaul commented 4 months ago
  1. Please add the Reactor backend that you've used - in some cases it was io_uring, in some cases linux-aio.
  2. It would be interesting (not now, but in the very near term future) to re-try with Ubuntu LTS 24.04, which is using kernel 6.8.
pwrobelse commented 4 months ago
  1. Please add the Reactor backend that you've used - in some cases it was io_uring, in some cases linux-aio.

  2. It would be interesting (not now, but in the very near term future) to re-try with Ubuntu LTS 24.04, which is using kernel 6.8.

Regarding the first question - it seems that iotune built from master and run in seastar's repo used io_uring by default. On the other hand iotune from ScyllaDB 5.4.6 AMI used linux-aio by default. I updated the summary at the end of the first comment.

The results from both back-end looked similar - see the statements below.

  1. The results obtained via iotune from AMI (backend=linux-aio) with XFS_BS=1024 and iotune_BS=1024 had the same value of random write IOPS, that the results obtained via iotune from the latest master (backend=io_uring) with XFS_BS=1024 and iotune_BS=1024. Please compare the logs from step 4 and from step 6.
  2. The results obtained via iotune from AMI (backend=linux-aio) with XFS_BS=4096 and iotune_BS=4096 had the same value of random write IOPS, that the results obtained via iotune from the latest master (backend=io_uring) with XFS_BS=4096 and iotune_BS=4096. Please compare the logs from step 9 and step 10.

During the next experiments I will specify the back-end explicitly.

pwrobelse commented 4 months ago

Hello, please find the results of another experiment related to specifying different value of block size for XFS during scylla_setup. The goal was to check if the results that had been seen after re-creating XFS with greater block size in the previous experiments can be reproduced.

Test scenario

  1. Create i4i.4xlarge instance with Ubuntu 22.04.
  2. Install ScyllaDB from a package according to the official tutorial.
  3. Optional: tweak block size value passed to XFS in /opt/scylladb/scripts/libexec/scylla_raid_setup.
  4. Call scylla_setup, configure RAID0 and XFS and inspect results from iotune.

Used scylla package:

ubuntu@ip-172-31-41-145:~$ scylla --version
5.4.6-0.20240418.10f137e367e3

Execution 1: XFS block size equals 1024 (default value used by scylla_raid_setup)

The first scenario serves as a control sample. The problem with low random write IOPS was reproduced (91K IOPS). Note: after alignment iotune used BS=1024.

Do you want IOTune to study your disks IO profile and adapt Scylla to it? (*WARNING* Saying NO here means the node will not boot in production mode unless you configure the I/O Subsystem manually!)
Yes - let iotune study my disk(s). Note that this action will take a few minutes. No - skip this step.
[YES/no]
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning: /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
INFO  2024-05-09 13:58:44,485 seastar - Reactor backend: linux-aio
INFO  2024-05-09 13:58:44,710 [shard  0:main] iotune - /var/lib/scylla/saved_caches passed sanity checks
INFO  2024-05-09 13:58:44,710 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2180 MB/s
Measuring sequential read bandwidth: 2971 MB/s (deviation 7%)
Measuring random write IOPS: 91687 IOPS
Measuring random read IOPS: 313118 IOPS
Writing result to /etc/scylla.d/io_properties.yaml
Writing result to /etc/scylla.d/io.conf

Execution 2: XFS block size equals 4096

The used block size value for XFS was changed to 4096 in /opt/scylladb/scripts/libexec/scylla_raid_setup. Then scylla_setup was run. The problem did not occur - iotune from ScyllaDB 5.4.6 showed 240k random write IOPS. Note: after alignment iotune used BS=4096.

Do you want IOTune to study your disks IO profile and adapt Scylla to it? (*WARNING* Saying NO here means the node will not boot in production mode unless you configure the I/O Subsystem manually!)
Yes - let iotune study my disk(s). Note that this action will take a few minutes. No - skip this step.
[YES/no]
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning: /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1/queue/nomerges 2
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
tuning /sys/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
INFO  2024-05-09 14:09:05,996 seastar - Reactor backend: linux-aio
INFO  2024-05-09 14:09:06,217 [shard  0:main] iotune - /var/lib/scylla/saved_caches passed sanity checks
INFO  2024-05-09 14:09:06,218 [shard  0:main] iotune - Disk parameters: max_iodepth=127 disks_per_array=1 minimum_io_size=512
Starting Evaluation. This may take a while...
Measuring sequential write bandwidth: 2179 MB/s
Measuring sequential read bandwidth: 2971 MB/s (deviation 7%)
Measuring random write IOPS: 239977 IOPS (deviation 11%)
Measuring random read IOPS: 311838 IOPS
Writing result to /etc/scylla.d/io_properties.yaml
Writing result to /etc/scylla.d/io.conf

Summary

Given the previous experiments and the current one it appears, that random write IOPS measured by iotune are correct when both XFS_BS=4096 and iotune_BS=4096 on i4i.4xlarge.

mykaul commented 4 months ago

There was a reason to use 1K, something with the RAID stripes... @xemul ?

xemul commented 4 months ago

There was a reason to use 1K, something with the RAID stripes... @xemul ?

No, it's just due to the way commitlog works. It needs to write aligned buffers so with 4k minima IO segments may grow too fast