Open travisdowns opened 6 months ago
Perhaps it is not in fact totally iotune
.
I investigated a bit more on i3en.12xlarge, which had the following iotune
results (as in the OP):
Instance Disks Read IOPS Read BW Write IOPS Write BW
i3en.6xlarge n/a 485579 4128679424 402086 2025674368
i3en.12xlarge n/a 550798 6819085312 496401 4051611392
(6xlarge also shown for reference: one would expect the 12xlarge numbers to all be double the 6x numbers)
I focused only on read IOPS. Using iotune
, I was able to get about 750k-760k read IOPS only using fio
regardless of what configuration I tried. This is still better than the 550k reported by iotune
but much less than the 1000k we expect. This is using the usual md
configuration of RAID0 across the 4 drives.
I then tried creating an array of only the first two drives, this would presumably give identical results to 6x large, 500k R IOPS. However, it did not: in fact it only gave 1/2 the 4-drive output, i.e., the same performance probably was still evident here in the the same proportion as the 4 drive case. The next odd thing is that this effect seems to depend on which drives are paired up in the raid. If drives 1 and 2 are paired, or 3 and 4 are paired they are slow as described (~350k IOPS), but any other pairs result in a fast configuration (~475k IOPS, just slightly shy of the theoretical), shown here based on testing each combination:
$ grep IOPS *
nvme1n1,nvme2n1: read: IOPS=376k, BW=1468MiB/s (1539MB/s)(86.0GiB/60003msec)
nvme1n1,nvme3n1: read: IOPS=476k, BW=1858MiB/s (1949MB/s)(109GiB/60005msec)
nvme1n1,nvme4n1: read: IOPS=475k, BW=1857MiB/s (1947MB/s)(109GiB/60004msec)
nvme2n1,nvme3n1: read: IOPS=476k, BW=1858MiB/s (1948MB/s)(109GiB/60004msec)
nvme2n1,nvme4n1: read: IOPS=476k, BW=1858MiB/s (1948MB/s)(109GiB/60005msec)
nvme3n1,nvme4n1: read: IOPS=377k, BW=1472MiB/s (1543MB/s)(86.2GiB/60003msec)
This behavior was consistent and quite stable from run to run and confirmed on 2 different machines.
Finally, I confirmed that this doesn't have anything to do with md
: the safe effect is present if you format each drive individually without using md
at all, then run separate benchmarks concurrently on two drives: the aggregate performance is as above (and each drive splits the IOPS equally). If you run just 1 test on a single drive you get very close to 250k IOPS, i.e., the expected advertised performance.
fio
configuration:
[file1]
name=fio-seq-write
rw=randread
bs=4K
direct=1
numjobs=8
time_based
runtime=5m
size=10GB
ioengine=libaio
iodepth=128
This isn't a special config: performance is similar under many parameter changes in the above: the main thing is that you need enough total queue depth (which is numjobs * iodepth
) - the above has 8k but even 4k is probably enough, and you must have enough jobs to avoid saturating single CPUs (so numjobs=1
doesn't cut it at this level because it will saturate a CPU, but 4 is generally enough).
I didn't notice any changes in performance with increased file size, runtime or depth. read
(sequential reads) and randread
perform similarly as long as merging is disabled at the block layer (otherwise sequential reads may be merged resulting in many fewer actual IOs and so an inflated IOPS figure).
Script to rebuild & remount arrays, using for testing all combination of array members:
set -euo pipefail
echo "MD=${MD:=md0}"
echo "DEVICES=${DEVICES:=nvme1n1 nvme2n1}"
echo "MOUNT=${MOUNT:=/mnt/xfs}"
IFS=', ' read -r -a DA <<< "$DEVICES"
echo "DEVICE_COUNT=${#DA[@]}"
MDD=/dev/$MD
sudo umount /mnt/xfs* || true
sudo mdadm --stop /dev/md* || true
# set -x
sudo mdadm --create --run --verbose $MDD --level=0 --raid-devices=2 $(for d in $DEVICES; do echo -n "/dev/$d "; done)
sudo mkfs.xfs -f $MDD
sudo mkdir -p $MOUNT
sudo mount $MDD $MOUNT
sudo chmod a+w $MOUNT
echo "OK - mounted at: $MOUNT"
cat /proc/mdstat
Miscellaneous node:
fio
-reported) use was significantly higher than the faster runs: and since the faster runs are doing more IOPS the CPU use is even higher in a CPU/IOP sense.fio
results for the above runs, I will upload them "at some point" or "on request".This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale
label – otherwise this will be closed in two weeks.
Version & Environment
Redpanda version: 23.3
What went wrong?
When running
rpk iotune
on large instance types, some results (especially IOPS) are often significantly lower than the vendor advertised numbers.See i3en for example:
Up to and include 6x large the values track closely the advertised IOPS. However 12xlarge reports 550k IOPS versus the advertised value of 1000k and 24xlarge ~1000k versus 2000k advertised.
I believe this is measurement/tuning error, not a fundamental hardware limitation.
This applies to other instance types as well, see https://github.com/redpanda-data/redpanda/pull/17220 for details.
What should have happened instead?
iotune
produces results reflecting the hardware capabilities.How to reproduce the issue?
rpk iotune
on i3en.12xlarge instances and observer outputJIRA Link: CORE-1915