oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
168 stars 18 forks source link

Performance mystery #1100

Closed mkeeter closed 8 months ago

mkeeter commented 9 months ago

Last week, James commented that he wasn't seeing the same performance improvements (e.g. for #1089) on his home cluster (the "oxide-ca region").

I tested on the Madrid minirack, and confirmed that performance wasn't nearly as good (10ish MiB/s for 4K random writes, 400-500 MiB/s for 1M and 4M random writes). Some of this is inevitable due to network latency, but I discovered that even basic fio (on a single machine) shows 10x performance differences between fast and slow machines.

(As such, this issue isn't strictly Crucible's fault, but this repo seems like a fine place for now; we can always relocate the issue later).

For these benchmarks, we are writing a 1 GiB file with

dd if=/dev/urandom of=/FILE/PATH/raw.bin bs=1M count=1024

And running the following job using fio (version 3.34):

[global]
filename=/FILE/PATH/raw.bin
time_based
runtime=60
numjobs=1
direct=1
stonewall=1

[randwrite-4K]
bs=4K
rw=randwrite

We've since done experiments on a range of machines and datasets. Machines are listed below:

Common name Serial number IPv6 address proxyjump Software
C12 BRM42220012 fe80::eaea:6aff:fe09:82d2 cadbury No control plane
Madrid BRM44220001 fe80::eaea:6aff:fe09:7f66 jeeves Full control plane
gimlet-sn05 EVT22200005 ? igor Some control plane
London BRM42220034 fe80::eaea:6aff:fe09:8567 jeeves Full control plane

Machines fall into "fast" or "slow" buckets, where fast machines see > 600 MiB/s, and slow machines see 50-ish MiB/s performance.

Here's a quick summary of performance

Machine Unencrypted datasets Encrypted datasets
C12 Fast Fast
Madrid Slow Slow
gimlet-sn05 Fast Slow
London Slow Slower (2 MiB/s?)

Details on testing

To test unencrypted performance, we write the 1 GiB file to the root directory of a zpool associated with one of the U.2 drives. On Madrid, this would be (for example) /oxp_1310e803-393e-4183-a075-8cad94db0a54/raw.bin.

The machines running the control plane stack also include encrypted datasets within that zpool. For example, on Madrid:

BRM44220001 # zfs list -r oxp_1310e803-393e-4183-a075-8cad94db0a54
NAME                                                                                                           USED  AVAIL     REFER  MOUNTPOINT
oxp_1310e803-393e-4183-a075-8cad94db0a54                                                                      60.3G  2.76T     58.1G  /oxp_1310e803-393e-4183-a075-8cad94db0a54
oxp_1310e803-393e-4183-a075-8cad94db0a54/crucible                                                             65.5K  2.76T     41.5K  /data
oxp_1310e803-393e-4183-a075-8cad94db0a54/crucible/regions                                                       24K  2.76T       24K  /data/regions
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt                                                                2.13G  2.76T      133K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/debug                                                          1.85G  98.1G     1.85G  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/debug
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone                                                            287M  2.76T      408K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_crucible_94dbf7ea-6019-4813-9407-3e47d9c11903          286M  2.76T      286M  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_crucible_94dbf7ea-6019-4813-9407-3e47d9c11903
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_3244882c-e91f-4645-ab19-f76259d02793   134K  2.76T      134K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_3244882c-e91f-4645-ab19-f76259d02793
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_351b1da3-fc33-4492-95b6-23448d4e5654   135K  2.76T      135K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_351b1da3-fc33-4492-95b6-23448d4e5654
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_4e4aff10-ec7f-49c3-a3b4-a18ae556b0f3   166K  2.76T      166K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_4e4aff10-ec7f-49c3-a3b4-a18ae556b0f3
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_5af23d7e-4e03-4c3f-9371-7f8d7644b949   198K  2.76T      198K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_5af23d7e-4e03-4c3f-9371-7f8d7644b949
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_a68fb129-7a51-44ab-8f4c-b1b7ea74db5f   199K  2.76T      199K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_a68fb129-7a51-44ab-8f4c-b1b7ea74db5f
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_c171d334-9692-451b-8ecb-47a3cf1de442   135K  2.76T      135K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_c171d334-9692-451b-8ecb-47a3cf1de442
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_cec8c182-49f2-43e8-bfae-1c62c15fd1d9   167K  2.76T      167K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_cec8c182-49f2-43e8-bfae-1c62c15fd1d9
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_e12bc966-844a-47fa-9991-a71a0c3cf925   135K  2.76T      135K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_e12bc966-844a-47fa-9991-a71a0c3cf925
oxp_1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_f9648b3c-5677-4e57-b850-4b20e755ba88   135K  2.76T      135K  /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/zone/oxz_propolis-server_f9648b3c-5677-4e57-b850-4b20e755ba88

To test encrypted performance on Madrid, we'd write raw.bin to somewhere in the encrypted dataset, e.g. /pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/raw.bin.

It's not (just) encryption

On Alan's bench Gimlet (sn05), we see fast performance for files in the root dataset and slow performance for files in the encrypted dataset. This suggests that encryption is at fault. However, it can't be the root cause: Madrid is slow for both unencrypted and encrypted datasets, and C12 is fast for both.

Testing on C12 requires making our own encrypted dataset, since it's not running the control plane stack:

$ dd if=/dev/urandom of=/var/run/oxide/test_key.bin bs=32 count=1 # make a fake key
$ zfs create -okeyformat=raw -oencryption=aes-256-gcm -okeylocation=file:///var/run/oxide/test_key.bin cru1/crypt
$ dd if=/dev/urandom of=/cru1/crypt/raw.bin bs=1M count=1024

At this point, running the fio script on /cru1/crypt/raw.bin shows fast performance, despite the dataset being encrypted.

Still, the behavior on Alan's Gimlet is certainly suspicious...

It's not ashift

Testing with ashift=9 versus ashift=12 doesn't seem to make a difference

It's not host OS version

C12 was running an older build (helios-2.0.22117), but updating to the same build as Madrid (helios-2.0.22356) didn't change anything; it was still fast. James did an audit of changes between those revision, and didn't see anything suspicious.

It's not raw disk IO speed

We investigated raw disk IO speeds using the following DTrace script:

#pragma D option quiet
#pragma D option aggsortkey

/*
 * Notes:
 *   args[1]->dev_statname, blkdev8
 *   args[0]->b_edev,       73014444544
 *   args[1]->dev_pathname, /devices/pci@0,0/pci1de,fff9@3,3/pci1b96,0@0/blkdev@w0014EE81000D2E55,0
 */

io:::start
/args[0]->b_edev == 73014444544/
{
        start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
        @[args[0]->b_bcount] = avg(timestamp - start[args[0]->b_edev, args[0]->b_blkno]);
        start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

profile:::tick-1sec {
        printa(@);
        trunc(@);
        printf("\n----------------\n");
}

Here's a comparison between C12 and Madrid:

Screenshot 2024-01-12 at 1 26 39 PM

There's nothing here that would explain a 10x performance difference.

Disk IO size varies

Alan captured histograms of raw disk IO size, comparing slow versus fast on his bench Gimlet:

slow_histogram

fast_histogram

Notice that the slow system is doing many more small disk IOs

ZIO size varies

We also see size variations at the zio_write level of the stack. The fast system is doing more record-size writes (128 KiB, 2**17) than the slow system. This was captured with the following DTrace script:

fbt::zio_write:entry
/args[1]->spa_root_vdev == (void*)0xfffffcfa5d39d000/
{
        @ = quantize(arg5);
        /* The spa_root_vdev pointer above is found by eyeballing
           @[(uint64_t)args[1]->spa_root_vdev] = count();
         */
}
profile:::tick-1sec {
        printa(@);
        clear(@);
}

out_lin

Of course, the slow system is doing fewer writes overall. However, we also see that there are more small calls to zio_write in the slow system, despite fewer writes overall:

out_log

(this is the same data with a log-scale Y axis)

Combining this with Alan's graphs above, it seems like the fast system is doing a better job of aggregating record-size 128 KiB writes into 1 MiB physical writes to the disk, and can therefore do more of them (?),

leftwo commented 9 months ago

Just for the record, the actual IO sizes sent to the disk during the fast (and slow) FIO runs are not always powers of 2. Here is a histogram of the specific sizes sent to the disk during the test and their count. Note that no control plane software was running at the time this data was gathered so most (all?) of the IOs should be generated as a result of the FIO test.

For the slow run

   SIZE  0xSIZE   COUNT
    512     200    4081 ### 
   1024     400    2261 ## 
   1536     600     702 # 
   2048     800    1054 # 
   4096    1000     342 # 
   4608    1200     168 # 
  12288    3000     436 # 
  13312    3400     362 # 
  16384    4000     657 # 
  62464    f400     741 # 
 124928   1e800     195 # 
 131072   20000   34392 ################## 
 262144   40000   11655 ###### 
 393216   60000    6482 #### 
 524288   80000    6408 #### 
 655360   a0000    4167 ### 
 786432   c0000    3435 ## 
 917504   e0000    3267 ## 
 999424   f4000     276 # 
1048576  100000  117143 ############################################################ 

For the Fast run:

   SIZE  0xSIZE   COUNT
    512     200    2996 ## 
   1024     400    2241 ## 
   1536     600    1171 # 
   2048     800    2458 ## 
   2560     a00      73 # 
   3072     c00      95 # 
   4096    1000     108 # 
  11776    2e00     480 # 
  12288    3000     222 # 
  12800    3200     339 # 
  13312    3400     319 # 
  13824    3600     100 # 
  14336    3800     409 # 
 131072   20000     347 # 
 262144   40000     100 # 
 327680   50000     177 # 
 393216   60000      90 # 
 655360   a0000     107 # 
 786432   c0000      73 # 
1048576  100000  115110 ############################################################ 
jclulow commented 9 months ago

I think it's interesting that C12 is the only machine so far that's apparently always fast, and it's also the only machine where the control plane isn't running at all.

Is the VMM reservoir configured on C12, as it would be on the other systems?

Also, in our testing on Alan's Gimlet concurrently running both the fast and slow workloads would make both workloads slow because we were hitting an apparent limit of 10 concurrent I/O requests issued to the underlying drive so it was saturated from a ZFS perspective (perhaps not from an NVMe perspective though).

Have we tested all of the SSDs in each of the three Gimlets to make sure they behave the same way?

mkeeter commented 9 months ago

C12 didn't have the VMM reservoir configured, according to rsvrctrl:

C12:

$ /usr/lib/rsrvrctl -q
Free KiB:       0
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1011881204

Madrid:

Free MiB:       829016
Allocated MiB:  0
Transient Allocated MiB:        0
Size limit MiB: 988165

I configured the reservoir on C12 with /usr/lib/rsrvrctl -s 829016 (which took about 3 minutes to complete!). This didn't have any impact on performance.

I've tested several of the U.2 drives on C12, and they're all performing similarly well.

leftwo commented 9 months ago

I ran this dtrace script on my bench gimlet (sn05) for both a fast and a slow FIO run.

#pragma D option quiet

/*
 * Notes:
 *   args[1]->dev_statname, blkdev8
 *   args[0]->b_edev,       73014444544
 *   args[1]->dev_pathname, /devices/pci@0,0/pci1de,fff9@3,3/pci1b96,0@0/blkdev@w0014EE81000D2E55,0
 */

io:::start
/args[0]->b_edev == 73014444544/
{
        start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

/*
 * First column is read or write
 * Second column is IO size
 * Third column is start time in ns for the IO
 * Fourth column is end time in ns for the IO
 * Fifth column is how long the IO took in ns.
 */
io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
        done = timestamp;
        this->elapsed = done - start[args[0]->b_edev, args[0]->b_blkno];
        printf("%2s %9d %9u %9u %9u\n",
            args[0]->b_flags & B_READ ? "R" : "W",
            args[0]->b_bcount,
            start[args[0]->b_edev, args[0]->b_blkno],
            done,
            this->elapsed
        );
        start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

I took the resulting data files and made a graph of the first second of IOs. The X axis is the time the IO started (with time 0 as the first IO), Y axis is the time the IO took to finish.

1 second of IOs from the fast run: IO_Lat99_fast_1sec

1 second of IOs from the slow run: IO_Lat99_slow_1sec

mkeeter commented 9 months ago

I like this idea, but am skeptical about getting slow / fast data from SN05 alone: because the "slow" condition has encryption enabled, I think it may be comparing apples and oranges.

Using your DTrace script on C12 versus Madrid (which are both unencrypted), I get some interesting results:

c12_io_speed

madrid_io_speed

This gives credence to @jmpesp 's theory that the slow condition is flushing more often – or some other form of periodic work!

Python script for plotting ```python import matplotlib.pylab as plt import matplotlib.cm as cm import sys import numpy as np s = open(sys.argv[1]).read().split('\n') data = np.array([[int(v) for v in line.split()[1:]] for line in s if line]) # We only care about the first second of data t = np.where((data[:,1] - data[0,1]) > 3e9)[0][0] plt.figure(figsize=(10, 6), layout='constrained') plt.scatter( (data[:t,1] - data[0,1]) / 1e6, data[:t,3] / 1e6, c=np.log2(data[:t,0]), s=3, cmap=cm.turbo) plt.xlabel('start time (ms)') plt.ylabel('IO time (ms)') plt.colorbar(label='IO size (log2 bytes)') if len(sys.argv) > 2: plt.title(sys.argv[2]) else: plt.title(sys.argv[1]) plt.savefig(sys.argv[1].split('.')[0] + '.png', bbox_inches='tight', dpi=300) plt.show() ```
mkeeter commented 9 months ago

Modifying the DTrace script to capture when spa_sync is called, it turns out that it lines up perfectly with the periodicity of the write IOs!

c12_io_speed_sync

madrid_io_speed_sync

Scripts ```dtrace #pragma D option quiet /* * Notes: * args[1]->dev_statname, blkdev8 * args[0]->b_edev, 73014444352 * args[1]->dev_pathname, /devices/pci@0,0/pci1de,fff9@3,3/pci1b96,0@0/blkdev@w0014EE81000D2E55,0 */ io:::start /args[0]->b_edev == 73014444352/ { start[args[0]->b_edev, args[0]->b_blkno] = timestamp; } fbt::spa_sync:entry /(uintptr_t)args[0] == 0xfffffcfa805c3000/ { printf("%2s %9d %9u %9u %9u\n", "S", 0, timestamp, 0, 0); } /* * First column is read or write * Second column is time in ns for the IO * Third column is size of the IO */ io:::done /start[args[0]->b_edev, args[0]->b_blkno]/ { done = timestamp; this->elapsed = done - start[args[0]->b_edev, args[0]->b_blkno]; printf("%2s %9d %9u %9u %9u\n", args[0]->b_flags & B_READ ? "R" : "W", args[0]->b_bcount, start[args[0]->b_edev, args[0]->b_blkno], done, this->elapsed ); start[args[0]->b_edev, args[0]->b_blkno] = 0; } ``` I found the pointer for `spa_sync` with `dtrace -n 'fbt::spa_sync:entry{ @[args[0]->spa_name, args[0]] = count(); }'`, then looked for the value matching the expected pool name. ```python import matplotlib.pylab as plt import matplotlib.cm as cm import sys import numpy as np s = open(sys.argv[1]).read().split('\n') data = np.array([[int(v) for v in line.split()[1:]] for line in s if line and line.split()[0] != 'S']) syncs = np.array([int(line.split()[2]) for line in s if line and line.split()[0] == 'S']) syncs -= data[0,1] # We only care about the first second of data t = np.where((data[:,1] - data[0,1]) > 3e9)[0][0] plt.figure(figsize=(10, 6), layout='constrained') plt.vlines(syncs[syncs < 3e9] / 1e6, -0.5, -0.1, zorder=0) plt.scatter( (data[:t,1] - data[0,1]) / 1e6, data[:t,3] / 1e6, c=np.log2(data[:t,0]), s=3, cmap=cm.turbo) plt.xlabel('start time (ms)') plt.ylabel('IO time (ms)') plt.colorbar(label='IO size (log2 bytes)') plt.ylim(-0.5, 4) plt.yticks(np.arange(0, 4.1, 0.5)) if len(sys.argv) > 2: plt.title(sys.argv[2]) else: plt.title(sys.argv[1]) plt.savefig(sys.argv[1].split('.')[0] + '.png', bbox_inches='tight', dpi=300) plt.show() ```
jclulow commented 9 months ago

I seem to recall when Alan and I were looking at the write throttle stuff that was causing the slow path to go to sleep, that before sleeping it would pretty aggressively try to start syncing the txg out. So, it may not necessarily be the cause as much as a symptom.

jclulow commented 9 months ago

Also it would help to fix the I/O time axis on the left so that the two graphs are the same height

mkeeter commented 9 months ago

I tried adjusting the write throttle by bringing zfs_dirty_data_max from 4 GiB to 16 GiB:

mdb -kwe 'zfs_dirty_data_max/z0x400000000'

On C12 (the fast machine), this boosted performance from 650 MiB/s up to about 939 MiB/s. In the IO trace, I see that spa_sync is called much less frequently (notice the X axis is different in this graph!)

c12_io_speed_16G

Running dirty.d from this blog post, I see that we're using about 1 GiB during the fio load condition:

CPU     ID                    FUNCTION:NAME
 76  19129      txg_sync_thread:txg-syncing 1025MB of 16384MB used
103  19129      txg_sync_thread:txg-syncing 1025MB of 16384MB used
120  19129      txg_sync_thread:txg-syncing 1025MB of 16384MB used
 86  19129      txg_sync_thread:txg-syncing 1025MB of 16384MB used
110  19129      txg_sync_thread:txg-syncing 1025MB of 16384MB used

On Madrid, after making the same adjustment (setting zfs_dirty_data_max to 16 GiB), I see no change in performance – it's still about 50 MiB/s running the fio load in an unencrypted dataset. The spa_sync interval is also the same as before, other than what looks like a longer delay before the first spa_sync:

madrid_io_speed_16G

Running dirty.d, I see that it levels out at a smaller value (which isn't surprising, given that we're finishing transaction groups more frequently)

CPU     ID                    FUNCTION:NAME
 86  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used
  1  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used
 43  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used
 18  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used
 19  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used
 18  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used
 18  14915      txg_sync_thread:txg-syncing  256MB of 16384MB used

It's not clear to me why Madrid is choosing to sync transaction groups more quickly. I don't see anyone calling fsync every 400-ish milliseconds (checked with dtrace -s 'fbt::zfs_sync:entry'), and we're well below 60% of the dirty pool.

mkeeter commented 9 months ago

We've arrived at a quasi-root cause: the ARC's target size on slow machines is tiny, causing cache evictions, frequent txg syncs, and general badness.

How we got there: Madrid is syncing transaction groups because dmu_tx_try_assign is failing with ERESTART. Tracking this through the ZFS source, the ERESTART is coming from dsl_dir_tempreserve_space -> arc_tempreserve_space -> this chunk of code:

/*
     * Throttle writes when the amount of dirty data in the cache
     * gets too large.  We try to keep the cache less than half full
     * of dirty blocks so that our sync times don't grow too large.
     *
     * In the case of one pool being built on another pool, we want
     * to make sure we don't end up throttling the lower (backing)
     * pool when the upper pool is the majority contributor to dirty
     * data. To insure we make forward progress during throttling, we
     * also check the current pool's net dirty data and only throttle
     * if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
     * data in the cache.
     *
     * Note: if two requests come in concurrently, we might let them
     * both succeed, when one of them should fail.  Not a huge deal.
     */
    uint64_t total_dirty = reserve + arc_tempreserve + anon_size;
    uint64_t spa_dirty_anon = spa_dirty_data(spa);

    if (total_dirty > arc_c * zfs_arc_dirty_limit_percent / 100 &&
        anon_size > arc_c * zfs_arc_anon_limit_percent / 100 &&
        spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) {
        uint64_t meta_esize =
            zfs_refcount_count(
            &arc_anon->arcs_esize[ARC_BUFC_METADATA]);
        uint64_t data_esize =
            zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
        dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
            "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
            arc_tempreserve >> 10, meta_esize >> 10,
            data_esize >> 10, reserve >> 10, arc_c >> 10);
        return (SET_ERROR(ERESTART));
    }

Looking at the ARC stats on London (kstat -m zfs -n arcstats), the issue is that the ARC has a target size (c) of 1 GiB, and an actual size (size) of 50 GiB:

BRM42220034 # kstat -m zfs -n arcstats
module: zfs                             instance: 0
name:   arcstats                        class:    misc
        access_skip                     47
        anon_evictable_data             0
        anon_evictable_metadata         0
        anon_size                       3166208
        arc_meta_limit                  237425426432
        arc_meta_max                    2249081056
        arc_meta_min                    536870912
        arc_meta_used                   1795827616
        async_upgrade_sync              39
        c                               1073741824
        c_max                           949701705728
        c_min                           1073741824
        compressed_size                 26597786624
        crtime                          70.394662000
        data_size                       52631310848
        deleted                         967716
        demand_data_hits                125562715
        demand_data_misses              12808504
        demand_hit_predictive_prefetch  17427
        demand_hit_prescient_prefetch   0
        demand_metadata_hits            1138481485
        demand_metadata_misses          42599983
        evict_l2_cached                 0
        evict_l2_eligible               482089895424
        evict_l2_eligible_mfu           296411142144
        evict_l2_eligible_mru           185678753280
        evict_l2_ineligible             766464
        evict_l2_skip                   0
        evict_not_enough                107171
        evict_skip                      9692
        hash_chain_max                  3
        hash_chains                     230
        hash_collisions                 139875
        hash_elements                   254849
        hash_elements_max               414225
        hdr_size                        96791312
        hits                            1264046167
        ... l2 details elided, since they're all zero
        memory_throttle_count           0
        metadata_size                   1160457216
        mfu_evictable_data              0
        mfu_evictable_metadata          0
        mfu_ghost_evictable_data        816718848
        mfu_ghost_evictable_metadata    74317312
        mfu_ghost_hits                  141
        mfu_ghost_size                  891036160
        mfu_hits                        1167362146
        mfu_size                        2177381376
        misses                          55450820
        mru_evictable_data              0
        mru_evictable_metadata          0
        mru_ghost_evictable_data        0
        mru_ghost_evictable_metadata    0
        mru_ghost_hits                  5
        mru_ghost_size                  0
        mru_hits                        96682060
        mru_size                        51611220480
        mutex_miss                      11799
        other_size                      419793808
        overhead_size                   27193981440
        p                               617493314
        prefetch_data_hits              0
        prefetch_data_misses            25675
        prefetch_metadata_hits          1967
        prefetch_metadata_misses        16658
        size                            54427138464
        snaptime                        694642.204181386
        uncompressed_size               27195558912

On C12, meanwhile, the ARC has a target size of 884 GiB (lol) and an actual size of 2 GiB (makes sense, because we're not doing much with it).

::memstat on London reveals that we've still got 100 GiB of RAM free:

BRM42220034 # mdb -ke ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   11583280             45247    4%
Boot pages                     13                 0    0%
ZFS File Data            12972040             50672    5%
VMM Memory              212228096            829016   80%
Anon                      1515661              5920    1%
Exec and libs              127864               499    0%
Page cache                  95109               371    0%
Free (cachelist)           439759              1717    0%
Free (freelist)          26321505            102818   10%

Total                   265283327           1036262
Physical                265283325           1036262

It's not clear whether the ARC is behaving as intended here, but it seems to be the root cause of our slowdown compared to C12.


Meanwhile, on C12, after we set the VMM reservoir limits, we still see an ARC target size of 884 GiB, which may also be a bug (or the same bug, with a different manifestation?)

BRM42220012 # mdb -ke ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   10352847             40440    4%
Boot pages                     13                 0    0%
ZFS File Data             1008656              3940    0%
Anon                        32328               126    0%
Exec and libs                7818                30    0%
Page cache                   4306                16    0%
Free (cachelist)             7545                29    0%
Free (freelist)         253868278            991672   96%

Total                   265281791           1036256
Physical                265281789           1036256
BRM42220012 # /usr/lib/rsrvrctl -q
Free KiB:       0
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 1011881204
BRM42220012 # /usr/lib/rsrvrctl -s 829016
BRM42220012 # mdb -ke ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                   10352964             40441    4%
Boot pages                     13                 0    0%
ZFS File Data             1008666              3940    0%
VMM Memory              212228096            829016   80%
Anon                        32341               126    0%
Exec and libs                7818                30    0%
Page cache                   4306                16    0%
Free (cachelist)             7555                29    0%
Free (freelist)          41640032            162656   16%

Total                   265281791           1036256
Physical                265281789           1036256
BRM42220012 # kstat -m zfs -n arcstats
module: zfs                             instance: 0
name:   arcstats                        class:    misc
        access_skip                     8
        anon_evictable_data             0
        anon_evictable_metadata         0
        anon_size                       171008
        arc_meta_limit                  237424050176
        arc_meta_max                    327241928
        arc_meta_min                    59356012544
        arc_meta_used                   139695016
        async_upgrade_sync              0
        c                               949696200704
        c_max                           949696200704
        c_min                           118712025088
        compressed_size                 1088078336
        crtime                          409.405263066
        data_size                       2151949312
        deleted                         86
        demand_data_hits                1734215
        demand_data_misses              160287
        demand_hit_predictive_prefetch  4
        demand_hit_prescient_prefetch   0
        demand_metadata_hits            126546483
        demand_metadata_misses          1033399
        evict_l2_cached                 0
        evict_l2_eligible               1458688
        evict_l2_eligible_mfu           0
        evict_l2_eligible_mru           1458688
        evict_l2_ineligible             12288
        evict_l2_skip                   0
        evict_not_enough                0
        evict_skip                      27
        hash_chain_max                  1
        hash_chains                     0
        hash_collisions                 242
        hash_elements                   9692
        hash_elements_max               9697
        hdr_size                        5906744
        hits                            128280698
        ... l2 details elided, since they're all zero
        memory_throttle_count           0
        metadata_size                   124881408
        mfu_evictable_data              0
        mfu_evictable_metadata          37888
        mfu_ghost_evictable_data        0
        mfu_ghost_evictable_metadata    0
        mfu_ghost_hits                  0
        mfu_ghost_size                  0
        mfu_hits                        123513723
        mfu_size                        13667328
        misses                          1194132
        mru_evictable_data              0
        mru_evictable_metadata          41984
        mru_ghost_evictable_data        0
        mru_ghost_evictable_metadata    0
        mru_ghost_hits                  0
        mru_ghost_size                  0
        mru_hits                        4766975
        mru_size                        2262992384
        mutex_miss                      0
        other_size                      7654064
        overhead_size                   1188752384
        p                               474848100352
        prefetch_data_hits              0
        prefetch_data_misses            0
        prefetch_metadata_hits          0
        prefetch_metadata_misses        446
        size                            2291644328
        snaptime                        29515.764225355
        uncompressed_size               1190047744

https://github.com/oxidecomputer/stlouis/issues/466 may also be relevant!

leftwo commented 9 months ago

To also close the loop on the issue I was seeing on my bench gimlet (sn05), which we determined was slightly different. In my case, I was comparing traffice between IO to a file where one was on an encrypted dataset and the other was not.

This issue does deserve more analysis, but because we also discovered that crucible volumes were not on encrypted datasets, I feel the urgency is somewhat lowered as far as performance and crucible. The actual region where crucible data is stored does not exibit the same 10x slowdown I measured on sn05.

mkeeter commented 9 months ago

The Madrid mini-rack was conveniently power-cycled last night. When it came back up, I observed

This lines up with the theory that ARC pressure is causing the bad fio performance: on a "fast" machine, the whole 1 GiB file gets loaded into the ARC, while on slow machines the ARC is under high pressure.

After doing some baseline Crucible tests, I noticed that the ARC was back to clamping arc_c at 1 GiB (on both the Upstairs and Downstairs sleds). This was surprising to me, so I decided to watch the relevant kstats on the Downstairs sled.

kstat -m zfs -n arcstats -s'/^c$|^size$/' 1

The resulting graph is interesting:

Screenshot 2024-01-18 at 11 44 42 AM

arc_c grows along with size to a point, then the two diverge and arc_c falls to its minimum value (1 GiB) while size continues growing until about 54 GiB.


I can also reproduce this on a single sled (C12), after configuring the VMM reservoir, dump, and swap. To reproduce on a single sled, I created a 64 GiB file from dd if=/dev/urandom of=/cru1/chonk.bin bs=1G count=64 (this took a long time).

Then, doing a long-running (> 120 second) fio job on that file eventually fills up the ARC and shows the same divergence:

[global]
filename=/cru1/chonk.bin
time_based
runtime=180
numjobs=1
direct=1

[randwrite-4K]
bs=4K
rw=randwrite
Screenshot 2024-01-18 at 2 04 41 PM

(Unlike the Madrid Downstairs sled, arc_c starts at 884 GiB here, because this was a completely fresh boot)


My naive impression is that this looks like a failure of the ARC size control loop: I would expect size to be being driven to <= arc_c, but the system remains in a state where size is > 50 GiB and arc_c is 1 GiB.

My expectations could be wrong, so I'm going to be looking at the ARC source code next.

mkeeter commented 9 months ago

One more small update from the coal face: the ARC is getting stuck at 50 GiB because there's apparently an eviction moratorium in effect.

BRM42220012 # kstat -m zfs -n arcstats | grep evictable
        anon_evictable_data             0
        anon_evictable_metadata         0
        mfu_evictable_data              0
        mfu_evictable_metadata          0
        mfu_ghost_evictable_data        0
        mfu_ghost_evictable_metadata    0
        mru_evictable_data              0
        mru_evictable_metadata          0
        mru_ghost_evictable_data        0
        mru_ghost_evictable_metadata    0

I'm not sure why the ARC is refusing to evict 50-some GiB of data; as far as I can tell, this means that someone is keeping a reference to the buffers.

Quoth arc.c:

At any given moment, some subset of the blocks in the cache are un-evictable because we have handed out a reference to them. Blocks are only evictable when there are no external references active. This makes eviction far more problematic: we choose to evict the evictable blocks that are the "lowest" in the list.

There are times when it is not possible to evict the requested space. In these circumstances we are unable to adjust the cache size. To prevent the cache growing unbounded at these times we implement a "cache throttle" that slows the flow of new data into the cache until we can make space available.

This matches what we're seeing: we've got unevictable blocks, our actual cache size is well above the target, and we're being throttled.

The question is why the blocks are unevictable, since I don't know who would be keeping a reference to them.

Suspicious about bad references are exacerbated by the fact that half of the ARC data is in overhead_size = 27193981440. I believe this represents uncompressed versions of ARC data (and it's about the same size because we're working with random data).

In the source, overhead_size is described as:

This is classified as "overhead" since this data is typically short-lived and will be evicted from the arc when it becomes unreferenced

However, it appears to be far from short-lived: those 25 GiB of overhead are sticking around indefinitely!

mkeeter commented 9 months ago

Booting into a debug kernel and tweaking some knobs (see https://github.com/oxidecomputer/stlouis/issues/528 for why we have to disable prefetch)

mdb -kwe zfs_prefetch_disable/z1
mdb -kwe reference_tracking_enable/z1

We can now see reference count metadata in the Bad State (triggered by the same fio run as before):

> ::walk arc_buf_t | ::print arc_buf_t b_hdr | ::print -a arc_buf_hdr_t b_l1hdr.b_refcnt.rc_list | ::walk list | ::print reference_t
{
    ref_link = {
        list_next = 0xfffffd087edae958
        list_prev = 0xfffffd087edae958
    }
    ref_holder = 0xfffffd087edadd18
    ref_number = 0x1
    ref_removed = 0xbaddcafebaddcafe
}
{
    ref_link = {
        list_next = 0xfffffd082d97e330
        list_prev = 0xfffffd082d97e330
    }
    ref_holder = 0xfffffd082d979d70
    ref_number = 0x1
    ref_removed = 0xbaddcafebaddcafe
}
{
    ref_link = {
        list_next = 0xfffffd081dbedbf0
        list_prev = 0xfffffd081dbedbf0
    }
    ref_holder = 0xfffffd081dbef3e8
    ref_number = 0x1
    ref_removed = 0xbaddcafebaddcafe
}
{
    ref_link = {
        list_next = 0xfffffd07bd516328
        list_prev = 0xfffffd07bd516328
>> More [<space>, <cr>, q, n, c, a] ?

It's not immediately obvious who the reference holders are, though: it's a void* and various things can go in there depending the caller.

> ::walk arc_buf_t | ::print arc_buf_t b_hdr | ::print -a arc_buf_hdr_t b_l1hdr.b_refcnt.rc_list | ::walk list | ::print reference_t ref_holder ! sort
...lots of stuff
ref_holder = 0xfffffd1782034e48
ref_holder = 0xfffffd1782039170
ref_holder = 0xfffffd1782039400
ref_holder = 0xfffffd17820397d8
ref_holder = 0xfffffd1782039a68
ref_holder = 0xfffffd1782039cf8
ref_holder = 0xfffffd178203c020
ref_holder = 0xfffffd178203c3f8
ref_holder = 0xfffffd178203c688
ref_holder = 0xfffffd178203c918
ref_holder = 0xfffffd178203ce38
ref_holder = 0xfffffd1782042160
ref_holder = 0xfffffd17820423f0
ref_holder = 0xfffffd17820427c8
ref_holder = 0xfffffd1782042a58
ref_holder = 0xfffffd1782042ce8
ref_holder = 0xfffffd1782049010
ref_holder = 0xfffffd17820492a0
ref_holder = 0xfffffd1782049530
ref_holder = 0xfffffd1782049678
ref_holder = 0xfffffd1782f242d8
ref_holder = 0xfffffd1784fb1418
ref_holder = 0xfffffd17856b9ac8
ref_holder = 0xfffffd17856bd310
ref_holder = 0xfffffd17856bd5a0
ref_holder = 0xfffffd17856bdc08

Interestingly, the most common gap size between ref_holder is 328 bytes. I tried to find a struct that matched, but didn't have any luck.

bcantrill commented 9 months ago

Piping the ref_holder to ::whatis allows us to determine the cache that it's in -- and because the reference tracking necessitates a DEBUG kernel anyway, we also get full stack backtraces. Running this against a bench Gimlet that has run the prescribed fio run is revealing:

> ::walk arc_buf_t | ::print arc_buf_t b_hdr | ::print -a arc_buf_hdr_t b_l1hdr.b_refcnt.rc_list | ::walk list | ::print reference_t ref_holder | ::whatis
...
fffffd0577ffe668 is allocated from dmu_buf_impl_t:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffd0571290a60 fffffd0577ffe668      3821c979ab3 fffffd04678b20e0
                 fffffd03157ba008 fffffcfc7d099b00                0
                 kmem_cache_alloc_debug+0x37c
                 kmem_cache_alloc+0x253
                 dbuf_create+0x6b
                 dbuf_hold_impl+0x461
                 dbuf_hold_level+0x3e
                 dmu_tx_check_ioerr+0x56
                 dmu_tx_count_write+0x19c
                 dmu_tx_hold_write+0x98
                 zfs_write+0x4d0
                 fop_write+0x5d
                 pwrite+0x172

fffffd0577ffe7b0 is allocated from dmu_buf_impl_t:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffd0571290988 fffffd0577ffe7b0      3821c8dffc4 fffffd04678b20e0
                 fffffd03157ba008 fffffcfc7d091b80                0
                 kmem_cache_alloc_debug+0x37c
                 kmem_cache_alloc+0x253
                 dbuf_create+0x6b
                 dbuf_hold_impl+0x461
                 dbuf_hold_level+0x3e
                 dmu_tx_check_ioerr+0x56
                 dmu_tx_count_write+0x19c
                 dmu_tx_hold_write+0x98
                 zfs_write+0x4d0
                 fop_write+0x5d
                 pwrite+0x172

fffffd0577ffe8f8 is allocated from dmu_buf_impl_t:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffd057128b530 fffffd0577ffe8f8      3821c838065 fffffd04678b20e0
                 fffffd03157ba008 fffffcfc7d089c00                0
                 kmem_cache_alloc_debug+0x37c
                 kmem_cache_alloc+0x253
                 dbuf_create+0x6b
                 dbuf_hold_impl+0x461
                 dbuf_hold_level+0x3e
                 dmu_tx_check_ioerr+0x56
                 dmu_tx_count_write+0x19c
                 dmu_tx_hold_write+0x98
                 zfs_write+0x4d0
                 fop_write+0x5d
                 pwrite+0x172

fffffd0577ffea40 is allocated from dmu_buf_impl_t:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
fffffd057128b458 fffffd0577ffea40      3821c79c7fd fffffd04678b20e0
                 fffffd03157ba008 fffffcfc7d081c80                0
                 kmem_cache_alloc_debug+0x37c
                 kmem_cache_alloc+0x253
                 dbuf_create+0x6b
                 dbuf_hold_impl+0x461
                 dbuf_hold_level+0x3e
                 dmu_tx_check_ioerr+0x56
                 dmu_tx_count_write+0x19c
                 dmu_tx_hold_write+0x98
                 zfs_write+0x4d0
                 fop_write+0x5d
                 pwrite+0x172

So why are these dbufs still held? The answer is that there is a cache that effectively fronts the ARC -- the dbuf cache -- and no dbuf that is in the (LRU) dbuf cache can be evicted from the ARC. From dbuf.c:

/*  
 * The LRU dbuf cache uses a three-stage eviction policy:
 *      - A low water marker designates when the dbuf eviction thread
 *      should stop evicting from the dbuf cache.
 *      - When we reach the maximum size (aka mid water mark), we
 *      signal the eviction thread to run.
 *      - The high water mark indicates when the eviction thread
 *      is unable to keep up with the incoming load and eviction must
 *      happen in the context of the calling thread.
 *
 * The dbuf cache:
 *                                                 (max size)
 *                                      low water   mid water   hi water
 * +----------------------------------------+----------+----------+
 * |                                        |          |          |
 * |                                        |          |          |
 * |                                        |          |          |
 * |                                        |          |          |
 * +----------------------------------------+----------+----------+
 *                                        stop        signal     evict
 *                                      evicting     eviction   directly
 *                                                    thread
 *
 * The high and low water marks indicate the operating range for the eviction
 * thread. The low water mark is, by default, 90% of the total size of the
 * cache and the high water mark is at 110% (both of these percentages can be
 * changed by setting dbuf_cache_lowater_pct and dbuf_cache_hiwater_pct,
 * respectively). The eviction thread will try to ensure that the cache remains
 * within this range by waking up every second and checking if the cache is
 * above the low water mark. The thread can also be woken up by callers adding
 * elements into the cache if the cache is larger than the mid water (i.e max
 * cache size). Once the eviction thread is woken up and eviction is required,
 * it will continue evicting buffers until it's able to reduce the cache size
 * to the low water mark. If the cache size continues to grow and hits the high
 * water mark, then callers adding elements to the cache will begin to evict
 * directly from the cache until the cache is no longer above the high water
 * mark.
 */

There are several problems here, but one clear one: the dbuf cache sizes itself based (effectively) on physmem -- and doesn't adjust its high- and low-water marks to memory or ARC pressure. This is particular acute with the reservoir, which (in our case) takes 80% of physical memory. Our 1TB physical leaves us with an arc_c_max of 884 GB; the dbuf cache sizes itself at 1/32nd of this -- or ~27GB. An interesting experiment (and one that is safe) would be to take a sled that is degraded and lower its dbuf_cache_max_bytes to (say) 5GB to see if that results in some ARC release and expansion. More to come!

bcantrill commented 9 months ago

Further investigation reveals that while the dbuf cache sizing is our most immediate issue (and the cause for the unevictable data), another problem -- and the reason for the low c in the first place -- is that swapfs_minfree is absolutely loco (128 GiB!) and resulting in the system perceiving pressure that isn't, in fact, there. This is a bug; I have filed https://www.illumos.org/issues/16202. It still feels like the dbuf clogging needs to be addressed (along with one other grievous misuse of memory that I discovered along the way); I'll get illumos bugs filed on both.

mkeeter commented 8 months ago

Closing the loop here, I adjusted

mdb -kwe 'swapfs_minfree/z0x20000' # 512 MiB
mdb -kwe 'dbuf_cache_max_bytes/z0x140000000' # 5 GiB

and re-ran the fio job on the large (64 GiB) file, twice.

arc_size grows as you'd expect, and arc_c remains up at 884 GiB (way off the top of the graph):

Screenshot 2024-01-22 at 2 45 27 PM

The ARC shows plenty of evictable data:

BRM42220012 # kstat -m zfs -n arcstats | grep evicta
        anon_evictable_data             0
        anon_evictable_metadata         0
        mfu_evictable_data              235405312
        mfu_evictable_metadata          44544
        mfu_ghost_evictable_data        20354696192
        mfu_ghost_evictable_metadata    5705728
        mru_evictable_data              89509593088
        mru_evictable_metadata          57336320
        mru_ghost_evictable_data        2082471936
        mru_ghost_evictable_metadata    288724480

Hopping over and running the original fio job (on a 1 GiB file), I see good performance of 633 MiB/s.

I think this resolves the filesystem-level performance mystery, although there are still questions remaining about networked performance; we can defer those to a separate ticket.

mkeeter commented 8 months ago

The illumos fixes are now shipped, so the mystery is resolved!