iotune gets confused in disks if disks get overloaded

Saw this today:

During Scylla operation, SSD disks are reasonably fast and displaying low latency:

evice:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00 10080.50    0.00   384.19     0.00    78.05     2.12    0.21    0.21    0.00   0.08  79.15
sdc               0.00     0.00 10226.50    0.00   389.38     0.00    77.98    47.07    4.61    4.61    0.00   0.10 100.05
sdd               0.00     0.00 10318.00    0.00   389.51     0.00    77.31     2.17    0.21    0.21    0.00   0.08  79.15
sde               0.00     0.00 10180.00    0.00   384.94     0.00    77.44    42.21    4.14    4.14    0.00   0.10 100.00
sdf               0.00     0.00 10247.00    2.00   388.56     0.06    77.66     2.06    0.20    0.20    0.00   0.07  76.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00 49366.50    2.00  1936.64     0.06    80.34     0.00    0.00    0.00    0.00   0.00   0.00

However, IOTune produces really bad results for IOPS (bandwidth is fine)

Measuring sequential write bandwidth: 5060 MB/s
Measuring sequential read bandwidth: 5302 MB/s
Measuring random write IOPS: 1393 IOPS
Measuring random read IOPS: 5222 IOPS

As you can see, even at much higher request sizes we're doing more IOPS than that.

The output of iostat during the run is enlightening:

Write bandwidth phase:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.40    0.00   73.97   24.63    0.00    0.00
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    5.50    0.00     0.16     0.00    59.64     3.95  772.82  772.82    0.00 181.73  99.95
sdb               0.00     0.00    0.00  291.50     0.00   145.25  1020.49   116.05  419.19    0.00  419.19   3.43  99.90
sdc               0.00     0.00    0.00  263.00     0.00   130.75  1018.16   108.04  412.17    0.00  412.17   3.80  99.90
sdd               0.00     0.00    0.00  290.00     0.00   142.20  1004.25   115.31  398.72    0.00  398.72   3.45  99.95
sde               0.00     0.00    0.00  273.50     0.00   136.55  1022.48   110.80  395.65    0.00  395.65   3.65  99.95
sdf               0.00     0.00    0.00  289.00     0.00   144.25  1022.23   128.68  435.37    0.00  435.37   3.46  99.95
dm-0              0.00     0.00    4.00    0.00     0.06     0.00    32.00     3.95 1062.62 1062.62    0.00 249.87  99.95
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 1408.50     0.00   700.27  1018.21     0.00    0.00    0.00    0.00   0.00   0.00

I suspect that we are firing a bunch of requests and because we expect their latencies to be low we don't wait for them to complete. But if for some reason that's not the case then we'll taint the following phases. (unconfirmed) I/O wait is also quite high which could indicate we are blocking in the kernel by sending more requests than what the disk was configured for in nr_requests (but how?)

disks were trimmed before this run to make sure that was not the case.

Read IOPS phase also shows a lot of iowait:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    6.50    0.00     0.08     0.00    25.85     0.00    0.46    0.46    0.00   0.31   0.20
sdb               0.00     0.00 2147.50    6.50  1073.75     0.20  1021.10   130.61   60.62   60.81    0.00   0.46 100.00
sdc               0.00     0.00 2120.50    0.00  1060.25     0.00  1024.00    53.07   25.12   25.12    0.00   0.47  98.80
sdd               0.00     0.00 2079.50    0.00  1039.75     0.00  1024.00    43.39   21.56   21.56    0.00   0.47  97.50
sde               0.00     0.00 2151.00    0.00  1075.50     0.00  1024.00   132.35   61.53   61.53    0.00   0.46 100.00
sdf               0.00     0.00 2144.00    0.00  1072.00     0.00  1024.00    86.82   40.83   40.83    0.00   0.47 100.00
dm-0              0.00     0.00    5.50    0.00     0.07     0.00    24.73     0.00    0.36    0.36    0.00   0.18   0.10
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    1.00    0.00     0.02     0.00    32.00     0.00    1.00    1.00    0.00   1.00   0.10
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00 10537.00    7.00  5268.50     0.22  1023.36     0.00    0.00    0.00    0.00   0.00   0.00

scylladb / seastar

iotune gets confused in disks if disks get overloaded #593