iotune reports lower-than-expected IOPS on some large instances

travisdowns commented 6 months ago

Version & Environment

Redpanda version: 23.3

What went wrong?

When running rpk iotune on large instance types, some results (especially IOPS) are often significantly lower than the vendor advertised numbers.

See i3en for example:

        Instance  Disks    Read IOPS          Read BW   Write IOPS         Write BW
      i3en.large    n/a        42705        328001088        32485        162821712
     i3en.xlarge    n/a        85373        659501824        65265        326548864
    i3en.2xlarge    n/a       170723       1318909056       130508        653094592
    i3en.3xlarge    n/a       242725       2065906688       201103       1012843968
    i3en.6xlarge    n/a       485579       4128679424       402086       2025674368
   i3en.12xlarge    n/a       550798       6819085312       496401       4051611392
   i3en.24xlarge    n/a      1086137       8334144000      1005340       8104002048

Up to and include 6x large the values track closely the advertised IOPS. However 12xlarge reports 550k IOPS versus the advertised value of 1000k and 24xlarge ~1000k versus 2000k advertised.

I believe this is measurement/tuning error, not a fundamental hardware limitation.

This applies to other instance types as well, see https://github.com/redpanda-data/redpanda/pull/17220 for details.

What should have happened instead?

iotune produces results reflecting the hardware capabilities.

How to reproduce the issue?

Run rpk iotune on i3en.12xlarge instances and observer output

JIRA Link: CORE-1915

travisdowns commented 6 months ago

Perhaps it is not in fact totally iotune.

I investigated a bit more on i3en.12xlarge, which had the following iotune results (as in the OP):

        Instance  Disks    Read IOPS          Read BW   Write IOPS         Write BW
    i3en.6xlarge    n/a       485579       4128679424       402086       2025674368
   i3en.12xlarge    n/a       550798       6819085312       496401       4051611392

(6xlarge also shown for reference: one would expect the 12xlarge numbers to all be double the 6x numbers)

I focused only on read IOPS. Using iotune, I was able to get about 750k-760k read IOPS only using fio regardless of what configuration I tried. This is still better than the 550k reported by iotune but much less than the 1000k we expect. This is using the usual md configuration of RAID0 across the 4 drives.

I then tried creating an array of only the first two drives, this would presumably give identical results to 6x large, 500k R IOPS. However, it did not: in fact it only gave 1/2 the 4-drive output, i.e., the same performance probably was still evident here in the the same proportion as the 4 drive case. The next odd thing is that this effect seems to depend on which drives are paired up in the raid. If drives 1 and 2 are paired, or 3 and 4 are paired they are slow as described (~350k IOPS), but any other pairs result in a fast configuration (~475k IOPS, just slightly shy of the theoretical), shown here based on testing each combination:

$ grep IOPS *
nvme1n1,nvme2n1:  read: IOPS=376k, BW=1468MiB/s (1539MB/s)(86.0GiB/60003msec)
nvme1n1,nvme3n1:  read: IOPS=476k, BW=1858MiB/s (1949MB/s)(109GiB/60005msec)
nvme1n1,nvme4n1:  read: IOPS=475k, BW=1857MiB/s (1947MB/s)(109GiB/60004msec)
nvme2n1,nvme3n1:  read: IOPS=476k, BW=1858MiB/s (1948MB/s)(109GiB/60004msec)
nvme2n1,nvme4n1:  read: IOPS=476k, BW=1858MiB/s (1948MB/s)(109GiB/60005msec)
nvme3n1,nvme4n1:  read: IOPS=377k, BW=1472MiB/s (1543MB/s)(86.2GiB/60003msec)

This behavior was consistent and quite stable from run to run and confirmed on 2 different machines.

Finally, I confirmed that this doesn't have anything to do with md: the safe effect is present if you format each drive individually without using md at all, then run separate benchmarks concurrently on two drives: the aggregate performance is as above (and each drive splits the IOPS equally). If you run just 1 test on a single drive you get very close to 250k IOPS, i.e., the expected advertised performance.

travisdowns commented 6 months ago

fio configuration:

[file1]
name=fio-seq-write
rw=randread
bs=4K
direct=1
numjobs=8

time_based
runtime=5m
size=10GB
ioengine=libaio
iodepth=128

This isn't a special config: performance is similar under many parameter changes in the above: the main thing is that you need enough total queue depth (which is numjobs * iodepth) - the above has 8k but even 4k is probably enough, and you must have enough jobs to avoid saturating single CPUs (so numjobs=1 doesn't cut it at this level because it will saturate a CPU, but 4 is generally enough).

I didn't notice any changes in performance with increased file size, runtime or depth. read (sequential reads) and randread perform similarly as long as merging is disabled at the block layer (otherwise sequential reads may be merged resulting in many fewer actual IOs and so an inflated IOPS figure).

travisdowns commented 6 months ago

Script to rebuild & remount arrays, using for testing all combination of array members:

set -euo pipefail

echo "MD=${MD:=md0}"
echo "DEVICES=${DEVICES:=nvme1n1 nvme2n1}"
echo "MOUNT=${MOUNT:=/mnt/xfs}"
IFS=', ' read -r -a DA <<< "$DEVICES"
echo "DEVICE_COUNT=${#DA[@]}"

MDD=/dev/$MD

sudo umount /mnt/xfs* || true
sudo mdadm --stop /dev/md* || true

# set -x
sudo mdadm --create --run --verbose $MDD --level=0 --raid-devices=2 $(for d in $DEVICES; do echo -n "/dev/$d "; done)

sudo mkfs.xfs -f $MDD
sudo mkdir -p $MOUNT
sudo mount $MDD $MOUNT
sudo chmod a+w $MOUNT

echo "OK - mounted at: $MOUNT"

cat /proc/mdstat

travisdowns commented 6 months ago

Miscellaneous node:

I tried Ubuntu 20.04 and 22.04 with their default linux-aws kernels but no other distros or kernels yet. It would be interesting to test say an Amazon Linux 2 AMI.
The 12xlarge size has 48 vCPUs and so is the first one which exceeds the maximum NVMe queue count of 32 per drive: so not every CPU has a queue assigned to every drive. However, I couldn't immediately tie this to the effect, though more tests here are warranted.
In the slower 2-drive runs, the CPU (fio-reported) use was significantly higher than the faster runs: and since the faster runs are doing more IOPS the CPU use is even higher in a CPU/IOP sense.
I didn't yet have time to investigate how drive interrupt count varies in the two scenarios.
I have the full fio results for the above runs, I will upload them "at some point" or "on request".

github-actions[bot] commented 2 weeks ago

This issue hasn't seen activity in 3 months. If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

redpanda-data / redpanda