oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
157 stars 16 forks source link

Storage performance on London #1291

Closed mkeeter closed 2 months ago

mkeeter commented 3 months ago

Using iodriver and construct-additional-pylons, we've been doing performance benchmarking on the London mini-rack (2x switches, 4x Gimlets). Most recently, we did a group debugging session with 64x VMs, all doing 4K random writes to their own 64 GiB disk (structured as 4 KiB block size, 16384 blocks/extent, 1024 extents, encrypted).

During the meeting, we observed high CPU load and mediocre IOPS (120-160). Here are my notes from right after the meeting:

  • With 64 VMs doing 4K random writes, we're seeing tremendously high CPU load (< 10% idle in some machines, see the flamegraph above)
  • Inside the VM, we're only seeing 120-160ish IOPS, which is not great
  • The flamegraph and some poking with lockstat shows that we've got a bunch of lock contention when feeding the vdev queue
  • There is tremendous write / IO amplification (not more than a cursory calculation would lead you to expect, but that's already a lot)
    • Each 4KiB read requires reading + writing 2x128 KiB ZFS records on 3 machines, which is 384x IO amplification.
    • This is basically a worst-case workload
  • Given that write amplification, numbers from iostat and nicstat all seem to be on the right order of magnitude

For more see the recording, transcript, and chat.

The "flamegraph" discussed in these observations is shown below. It was captured using flamegraph -c'profile-997 { @[stack(100)] = count(); } -p1 (the PID is a dummy value, because we're profiling the kernel here).

flamegraph-24-05-02 (interactive)

Further investigation outside the scope of the meeting revealed bimodal performance:

Screenshot 2024-05-03 at 11 15 20 AM

(the X axis isn't particularly meaningful here)

We found that one of the sleds had much less IO throughput than the other three – specifically, every Upstairs with a Downstairs on BRM42220058 is slow.

IOPS as a function of Downstairs locations:

('BRM42220036', 'BRM42220058', 'BRM44220004'): [122.16, 118.26, 96.36, 118.65, 92.33, 110.95, 94.21, 118.56, 117.33, 184.93, 114.36, 94.0, 119.93, 120.68, 91.19, 90.04, 110.16, 115.73, 122.11, 161.48, 90.13, 161.86, 187.51, 161.64],
('BRM42220034', 'BRM42220058', 'BRM44220004'): [123.65, 106.53, 95.27, 123.01, 137.78, 95.06, 161.69, 93.09, 122.48, 93.12, 121.03, 135.13, 112.39, 161.49],
('BRM42220034', 'BRM42220036', 'BRM44220004'): [570.57, 543.65, 644.43, 536.79, 576.31, 556.7, 606.11, 551.77, 544.76, 534.79, 619.22, 593.58, 572.27, 620.19, 548.27, 557.04, 587.71],
('BRM42220034', 'BRM42220036', 'BRM42220058'): [108.05, 133.88, 93.72, 160.39, 109.27, 120.37, 137.68, 92.33,   93.58]}

This slow sled happens to be the one from which we took the flamegraph above.

Here's an example of a flamegraph from a non-slow sled:

flamegraph (interactive)

We can see the slower performance just by running iostat -x 1 2 (looking at the second sample printed):

Fast Sled
                          extended device statistics
device                      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
blkdev0                  2485.6 7099.2 313938.7 1042271.2  0.0  2.1    0.2   0  92
blkdev1                  2422.2 6083.8 304464.4 1001983.9  0.0  2.0    0.2   0  88
blkdev2                  3024.6 10137.6 383215.6 1184491.2  0.0  2.9    0.2   0  94
blkdev3                  2775.9 7647.1 349211.4 1094838.3  0.0  2.5    0.2   0  90
blkdev4                  1202.7   85.2 127429.4 5880.7  0.0  0.4    0.3   0  32
blkdev5                   797.5   89.2 101188.2 5714.3  0.0  0.3    0.3   0  27
blkdev6                  1282.9 1289.9 156139.3 327475.7  0.0  0.8    0.3   0  50
blkdev7                   838.1 2420.2 102079.8 546523.2  0.0  0.7    0.2   0  47
blkdev8                   140.7    8.9 16695.0  582.5  0.0  0.0    0.3   0   5
blkdev9                     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
blkdev10                    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
oxi_c12bc4bb-c3ac-4bcb-8    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
oxi_a4fdbac3-654a-4630-a    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
oxp_5ce09d79-f262-4904-9 1202.7   85.2 127429.3 5880.7  0.0  0.5    0.4   0  40
oxp_30791221-753b-499f-a 2775.9 7647.1 349211.1 1094837.4 278.2  2.8   27.0  92  95
oxp_bda65ee7-9689-48cc-b 2422.2 6083.8 304464.2 1001983.3 191.1  2.4   22.7  90  95
oxp_68860521-2b30-42d0-b 3024.6 10137.6 383215.5 1184490.8 235.0  3.4   18.1  96  98
oxp_69319eff-2402-4ae5-8 1282.9 1289.9 156139.4 327475.8 56.0  0.9   22.1  25  56
oxp_84161869-9075-40cc-9  140.7    8.9 16695.0  582.5  0.0  0.1    0.4   0   6
oxp_4fb500f1-44a8-4d4a-9  837.1 2420.2 101953.1 546523.5 133.2  0.9   41.2  42  54
oxp_f4f43e5a-596f-417b-b 2485.6 7099.2 313939.2 1042272.8 305.7  2.4   32.2  96  97
oxp_039e60b6-e8f1-4671-b  797.5   89.2 101188.3 5714.3  0.0  0.4    0.4   0  35
ramdisk2                   38.6  161.5   26.7  875.8  0.0  0.0    0.1   0   3
rpool                      38.6  161.5   26.7  875.8  0.5  0.0    2.6   2   3

Slow Sled
                          extended device statistics
device                      r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
blkdev0                   822.1 1769.4 105135.9 305915.1  0.0  1.9    0.7   0  66
blkdev1                   802.3  923.4 102496.5 258668.0  0.0  1.7    1.0   0  57
blkdev2                   859.9  995.9 109571.0 279176.2  0.0  2.5    1.3   0  60
blkdev3                   748.7  320.7 95733.5 109337.1  0.0  0.8    0.7   0  44
blkdev4                   446.8 1206.5 57071.7 258711.4  0.0  0.8    0.5   0  38
blkdev5                   804.3 1225.3 102851.8 242777.1  0.0  1.5    0.7   0  61
blkdev6                   688.1  829.1 88080.8 250038.8  0.0  1.4    0.9   0  51
blkdev7                   967.2 2361.3 123597.7 367472.6  0.0  3.8    1.1   0  77
blkdev8                   272.1 1033.7 34007.7 275350.0  0.0  0.6    0.4   0  31
blkdev9                     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
blkdev10                  683.2  789.4 87200.3 96073.6  0.0  0.6    0.4   0  44
blkdev11                    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
oxi_8381084c-70da-499b-9    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
oxi_cf7bd21f-e691-4bff-8    0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
oxp_641dd8ec-4282-4119-9  273.1 1034.7 34135.7 275556.3 152.4  1.0  117.3  96  55
oxp_7cdbf132-7ef9-4218-8  688.2  829.2 88084.3 250048.9 252.5  2.3  167.9 100  78
oxp_3f8b57a7-4144-4923-a  858.0  995.0 109323.4 278319.3 288.4  4.6  158.1  96  99
oxp_a207d42f-ae2f-4213-9  968.2 2363.4 123730.7 367744.4 385.8  5.9  117.6 100 100
oxp_c0d351f6-c580-4b44-9  820.2 1770.6 104891.7 306071.4 390.1  3.1  151.8 100  96
oxp_203178bc-43aa-43b3-b  800.4  924.5 102250.7 259706.0 385.1  3.2  225.1 100  94
oxp_304dcff2-c2d5-4cb4-8  683.2  789.5 87204.6 96078.4 34.9  1.2   24.5  39  71
oxp_0b88d5bd-388b-44cd-9  446.9 1207.5 57076.2 258735.6 148.7  1.4   90.7  96  66
oxp_26604b43-ffc2-47b3-9  749.8  320.8 95869.6 109347.4 130.9  1.4  123.7  42  71
oxp_b9090814-8b0e-465c-8  804.4 1225.4 102861.4 242799.7 254.3  2.5  126.5 100  88
ramdisk2                    0.0  110.2    0.0  483.6  0.0  0.0    0.2   0   2
rpool                       0.0  110.2    0.0  483.6  0.3  0.0    3.3   2   3

There's not a dramatically mismatched distribution of VMs per sled or regions per disk:

VMs per sled
BRM42220058: 14
BRM42220034: 18
BRM42220036: 16
BRM44220004: 16
disk 52bdbb74-b444-4828-9e62-c135a9fc5fab on sled BRM42220034 has 1 regions
disk d483c80a-a691-4bbb-973e-6a289c026e99 on sled BRM42220034 has 2 regions
disk 221b20fe-ae03-436c-a524-2b73d9f01f6f on sled BRM42220034 has 3 regions
disk dd13e616-bede-4d55-a8ff-f0e8086b7a83 on sled BRM42220034 has 4 regions
disk 92d3cbdc-1c4b-4c65-afe3-452d67ef55df on sled BRM42220034 has 5 regions
disk 2cda01c3-4a23-4b47-bb93-a659d41be9a9 on sled BRM42220034 has 6 regions
disk 608194ea-2a49-4384-a171-8a7b8d7585ac on sled BRM42220034 has 6 regions
disk d00898da-85c7-4154-8d5c-4a903e30e6f2 on sled BRM42220034 has 6 regions
disk e851c9e6-92e3-4a9f-85f9-fbc1fe4a3726 on sled BRM42220034 has 7 regions

disk 462b1a93-9dd5-4ef6-98aa-1a6c48932ed1 on sled BRM42220036 has 1 regions
disk eadc7272-f263-4fc9-b332-dc18fab6dc15 on sled BRM42220036 has 3 regions
disk 2b1294a8-92b7-4d37-935a-f0209cac7f1f on sled BRM42220036 has 4 regions
disk c3d1be4d-4e23-405c-9a26-2a01633231a6 on sled BRM42220036 has 4 regions
disk e36db697-c75f-439b-b8c5-acf8727edeaa on sled BRM42220036 has 4 regions
disk 5c377612-b0ab-47f0-9981-ec48019597f0 on sled BRM42220036 has 6 regions
disk 7330364e-1e81-4edf-89d1-9841e41c1c0b on sled BRM42220036 has 6 regions
disk 8caf6be7-86e8-4f83-8c76-9983a695ee24 on sled BRM42220036 has 7 regions
disk c6636c43-e130-409a-ba6e-08eb94af3bd3 on sled BRM42220036 has 7 regions
disk 68cdba16-1cee-4cd9-af12-e218bf08750f on sled BRM42220036 has 8 regions

disk eb9f06a4-be0e-4db5-a00b-e8f0a502b4fd on sled BRM42220058 has 2 regions
disk 08d8e5e0-0bb4-45d3-a49a-533e4349ca4c on sled BRM42220058 has 3 regions
disk ce9a2eb7-bfbe-4ad4-9bd2-ae83dea3592d on sled BRM42220058 has 3 regions
disk 14f39df3-2773-4146-a900-56ff19d30eb6 on sled BRM42220058 has 4 regions
disk 2142f782-d8cc-42ac-9f5d-cc54056b25e3 on sled BRM42220058 has 5 regions
disk 77993721-09db-472c-9fc4-9da21ccc4826 on sled BRM42220058 has 5 regions
disk be38e560-6e6b-4be3-ac6c-7ea8e8e8c848 on sled BRM42220058 has 5 regions
disk 62897702-87c7-4fc3-826b-87183288e98d on sled BRM42220058 has 6 regions
disk 62172e71-9050-463b-ad30-80d76d337351 on sled BRM42220058 has 7 regions
disk 777bc2f2-e8a0-4a2b-8c0a-5ead9ab6ee4f on sled BRM42220058 has 7 regions

disk 2d9cf575-bd19-45f6-a763-9e5ddfaacaef on sled BRM44220004 has 3 regions
disk 5582697e-95c7-4292-9eec-91d63179ed38 on sled BRM44220004 has 3 regions
disk 9b9a3f4f-f3df-4164-8662-35008db0d747 on sled BRM44220004 has 3 regions
disk a7d6083a-23b2-46ab-baf0-3ce6127afddc on sled BRM44220004 has 4 regions
disk d0f699dd-a76b-470a-adf5-91929392a84c on sled BRM44220004 has 5 regions
disk feccdf9c-c242-4996-8b49-33b5aa5c8028 on sled BRM44220004 has 5 regions
disk 35df9dcb-e727-401f-a88f-6ce191dc1fe8 on sled BRM44220004 has 7 regions
disk bf070721-dae8-4d67-b849-01cfe62fc89a on sled BRM44220004 has 8 regions
disk f1939edd-15bd-4cea-9360-46035ec2c268 on sled BRM44220004 has 8 regions
disk 8ad922e6-ad73-400d-8a58-9c9f1838837d on sled BRM44220004 has 9 regions

Using DTrace, I measured NVMe write performance relatively close to the metal. I found that the slow sled has noticeably longer delay between nvme_bd_write and bd_xfer_done:

$ dtrace -n '::nvme_bd_write:entry { start[arg1] = timestamp } ::bd_xfer_done:entry/start[arg0]/ { dt = timestamp - start[arg0]; @ = quantize(dt); start[arg0] = 0 } tick-10s { exit(0); }'
BRM44220004
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         205
           16384 |@                                        16689
           32768 |@@@@@@@                                  79189
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@                 282801
          131072 |@@@@@@                                   74074
          262144 |@                                        16300
          524288 |                                         5970
         1048576 |                                         2328
         2097152 |                                         1162
         4194304 |                                         342
         8388608 |                                         10
        16777216 |                                         0

BRM42220036
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         323
           16384 |@                                        17005
           32768 |@@@@@@@                                  83036
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@                 288937
          131072 |@@@@@@                                   69510
          262144 |@                                        14278
          524288 |                                         5020
         1048576 |                                         2101
         2097152 |                                         1102
         4194304 |                                         317
         8388608 |                                         6
        16777216 |                                         0

BRM42220034
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         25
           16384 |@                                        9891
           32768 |@@@@@                                    54437
           65536 |@@@@@@@@@@@@@@@@@@@                      189163
          131072 |@@@@@                                    54441
          262144 |@@@                                      30943
          524288 |@@@                                      26833
         1048576 |@@@                                      29047
         2097152 |@                                        10981
         4194304 |                                         1018
         8388608 |                                         77
        16777216 |                                         0

BRM42220058
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         2
           16384 |                                         903
           32768 |@@                                       5193
           65536 |@@@@@@@@                                 27197
          131072 |@@@@@@@@@@                               34524
          262144 |@@@@@@@                                  24571
          524288 |@@@@@@                                   19343
         1048576 |@@@@                                     14432
         2097152 |@@                                       8285
         4194304 |@                                        2048
         8388608 |                                         204
        16777216 |                                         3
        33554432 |                                         0

Weirdly enough, this slowdown isn't on a single drive; it's noticeable on every single NVMe drive on BRM42220058. Splitting them out by NVMe namespace (on just the slow sled), here's what we see:

BRM42220058 # dtrace -q -n '::nvme_bd_write:entry { start[arg1] = timestamp; dev[arg1] = arg0; } ::bd_xfer_done:entry/start[arg0]/ { dt = timestamp - start[arg0]; @[dev[arg0]] = quantize(dt); start[arg0] = 0; dev[arg0] = 0; } tick-10s { printa("\n%8x %@i", @); exit(0); }'

fffffd39b403a000
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |                                         52
           32768 |@@                                       307
           65536 |@@@@@@@@                                 1089
          131072 |@@@@@@@@@@@                              1537
          262144 |@@@@@@@@@@                               1335
          524288 |@@@@@                                    725
         1048576 |@@                                       245
         2097152 |                                         41
         4194304 |                                         49
         8388608 |                                         7
        16777216 |                                         1
        33554432 |                                         0

fffffd39b3d18000
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |@                                        88
           32768 |@@                                       371
           65536 |@@@@@@@@@                                1407
          131072 |@@@@@@@@@@                               1460
          262144 |@@@@@@@                                  972
          524288 |@@@@@@                                   938
         1048576 |@@@@                                     549
         2097152 |@                                        145
         4194304 |                                         29
         8388608 |                                         10
        16777216 |                                         0

fffffd39b413a000
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |                                         137
           32768 |@@@                                      908
           65536 |@@@@@@@@@@@@                             3732
          131072 |@@@@@@@@@@@                              3345
          262144 |@@@@@@@                                  1946
          524288 |@@@@                                     1194
         1048576 |@@                                       513
         2097152 |                                         119
         4194304 |                                         60
         8388608 |                                         9
        16777216 |                                         2
        33554432 |                                         0

fffffd39b3cfa000
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |                                         94
           32768 |@@                                       532
           65536 |@@@@@@@@@@                               3277
          131072 |@@@@@@@@@@@@                             3707
          262144 |@@@@@@@                                  2073
          524288 |@@@@@                                    1702
         1048576 |@@@                                      958
         2097152 |@                                        281
         4194304 |                                         72
         8388608 |                                         28
        16777216 |                                         2
        33554432 |                                         0

fffffd39b39aa000
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         1
           16384 |                                         109
           32768 |@@                                       533
           65536 |@@@@@@@                                  1734
          131072 |@@@@@@@@@                                2173
          262144 |@@@@@@@                                  1793
          524288 |@@@@@@@                                  1680
         1048576 |@@@@@                                    1323
         2097152 |@@                                       575
         4194304 |                                         75
         8388608 |                                         22
        16777216 |                                         1
        33554432 |                                         0

fffffd39b3ed8000
           value  ------------- Distribution ------------- count
            8192 |                                         0
           16384 |                                         76
           32768 |@                                        443
           65536 |@@@@@@@                                  2556
          131072 |@@@@@@@@@@                               3515
          262144 |@@@@@@@                                  2496
          524288 |@@@@@@                                   2136
         1048576 |@@@@@                                    1702
         2097152 |@@                                       793
         4194304 |                                         125
         8388608 |                                         18
        16777216 |                                         0

fffffd39b3e5a000
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         2
           16384 |                                         105
           32768 |@                                        654
           65536 |@@@@@@@@@@                               4420
          131072 |@@@@@@@@@@                               4703
          262144 |@@@@@@                                   2956
          524288 |@@@@@                                    2514
         1048576 |@@@@                                     1936
         2097152 |@@                                       999
         4194304 |                                         132
         8388608 |                                         31
        16777216 |                                         0

fffffd39b3f79000
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         1
           16384 |                                         40
           32768 |@                                        355
           65536 |@@@@@@                                   2037
          131072 |@@@@@@@@@                                2973
          262144 |@@@@@@                                   2142
          524288 |@@@@@@                                   2082
         1048576 |@@@@@@                                   1986
         2097152 |@@@@@                                    1826
         4194304 |@                                        451
         8388608 |                                         34
        16777216 |                                         1
        33554432 |                                         0

fffffd38f349f000
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         1
           16384 |                                         95
           32768 |@@                                       776
           65536 |@@@@@@                                   3123
          131072 |@@@@@@@@@@                               4734
          262144 |@@@@@@@                                  3502
          524288 |@@@@@@                                   2844
         1048576 |@@@@@                                    2432
         2097152 |@@@                                      1590
         4194304 |                                         236
         8388608 |                                         55
        16777216 |                                         2
        33554432 |                                         0

fffffd39b3c18000
           value  ------------- Distribution ------------- count
            4096 |                                         0
            8192 |                                         4
           16384 |                                         72
           32768 |@                                        501
           65536 |@@@@@                                    2956
          131072 |@@@@@@@@@                                5566
          262144 |@@@@@@                                   3817
          524288 |@@@@@@                                   3487
         1048576 |@@@@@                                    3269
         2097152 |@@@@@@                                   3563
         4194304 |@@                                       1142
         8388608 |                                         45
        16777216 |                                         3
        33554432 |                                         0

Taking the averages instead, we can compare time between nvme_bd_write and bd_xfer_done across all four sleds. The slow sled is a clear outlier.

BRM42220036
    fffffcfa19d7f000 114710
    fffffcfa19bff000 122212
    fffffcfa197f5000 124489
    fffffcfa199bd000 127162
    fffffcfa197eb000 127567
    fffffcfa1995b000 138039
    fffffcfa1999b000 184353
    fffffcfa1968f000 219783
    fffffcf959183000 228126
    fffffcfa19dff000 309553
BRM44220004
    fffffcfa19d7f000 118372
    fffffcfa1983b000 125469
    fffffcfa199df000 130446
    fffffcfa19a9f000 147211
    fffffcfa19b3d000 151543
    fffffcfa1968f000 172503
    fffffcf959183000 181910
    fffffcfa19ddf000 190039
    fffffcfa19a81000 210390
    fffffcfa19cc1000 275047
BRM42220034
    fffffcfa18690000 114492
    fffffcfa18618000 114938
    fffffcfa18bf7000 115243
    fffffcfa18889000 118443
    fffffcfa18847000 136506
    fffffcfa188a3000 154628
    fffffcfa18cb9000 165987
    fffffcfa18c97000 200015
    fffffcfa188ad000 204768
BRM42220058
    fffffd39b413a000 384558
    fffffd39b3cfa000 416939
    fffffd39b3d18000 445947
    fffffd39b403a000 495473
    fffffd39b3e5a000 620066
    fffffd39b3ed8000 622420
    fffffd39b39aa000 655330
    fffffd38f349f000 773759
    fffffd39b3c18000 958747
    fffffd39b3f79000 1054888
mkeeter commented 3 months ago

A few other observations:

BRM42220058 has less RAM than the other machines (917478 MiB versus 1036256 MiB). It's hard to imagine how a 12% difference would cause such a dramatic performance change, though.

The slow sled is also doing slightly more writes per call to nvme_bd_write. Looking at avg(args[1]->xblks) in ::nvme_bd_write:entry, a fast sled is touching 246 blocks, versus 304 for the slow sled.

Many things in the slow sled are slower; for example, running nvmeadm list is about 600 ms slower. Individual functions to build and submit jobs are slower, e.g. nvme_bd_writenvme_create_nvm_cmdnvme_fill_prp are all roughly 2x slower on the bad sled (measuring from :entry to :return).

jclulow commented 3 months ago

I have replaced BRM42220058 with 0XV1:9130000019:006:BRM42220041 which was at least just able to go through a full programming cycle on zoot.

faithanalog commented 2 months ago

This was solved by https://github.com/oxidecomputer/meta/issues/418