Closed mkeeter closed 8 months ago
Just for the record, the actual IO sizes sent to the disk during the fast (and slow) FIO runs are not always powers of 2. Here is a histogram of the specific sizes sent to the disk during the test and their count. Note that no control plane software was running at the time this data was gathered so most (all?) of the IOs should be generated as a result of the FIO test.
For the slow run
SIZE 0xSIZE COUNT
512 200 4081 ###
1024 400 2261 ##
1536 600 702 #
2048 800 1054 #
4096 1000 342 #
4608 1200 168 #
12288 3000 436 #
13312 3400 362 #
16384 4000 657 #
62464 f400 741 #
124928 1e800 195 #
131072 20000 34392 ##################
262144 40000 11655 ######
393216 60000 6482 ####
524288 80000 6408 ####
655360 a0000 4167 ###
786432 c0000 3435 ##
917504 e0000 3267 ##
999424 f4000 276 #
1048576 100000 117143 ############################################################
For the Fast run:
SIZE 0xSIZE COUNT
512 200 2996 ##
1024 400 2241 ##
1536 600 1171 #
2048 800 2458 ##
2560 a00 73 #
3072 c00 95 #
4096 1000 108 #
11776 2e00 480 #
12288 3000 222 #
12800 3200 339 #
13312 3400 319 #
13824 3600 100 #
14336 3800 409 #
131072 20000 347 #
262144 40000 100 #
327680 50000 177 #
393216 60000 90 #
655360 a0000 107 #
786432 c0000 73 #
1048576 100000 115110 ############################################################
I think it's interesting that C12 is the only machine so far that's apparently always fast, and it's also the only machine where the control plane isn't running at all.
Is the VMM reservoir configured on C12, as it would be on the other systems?
Also, in our testing on Alan's Gimlet concurrently running both the fast and slow workloads would make both workloads slow because we were hitting an apparent limit of 10 concurrent I/O requests issued to the underlying drive so it was saturated from a ZFS perspective (perhaps not from an NVMe perspective though).
Have we tested all of the SSDs in each of the three Gimlets to make sure they behave the same way?
C12 didn't have the VMM reservoir configured, according to rsvrctrl
:
C12:
$ /usr/lib/rsrvrctl -q
Free KiB: 0
Allocated KiB: 0
Transient Allocated KiB: 0
Size limit KiB: 1011881204
Madrid:
Free MiB: 829016
Allocated MiB: 0
Transient Allocated MiB: 0
Size limit MiB: 988165
I configured the reservoir on C12 with /usr/lib/rsrvrctl -s 829016
(which took about 3 minutes to complete!). This didn't have any impact on performance.
I've tested several of the U.2 drives on C12, and they're all performing similarly well.
I ran this dtrace script on my bench gimlet (sn05) for both a fast and a slow FIO run.
#pragma D option quiet
/*
* Notes:
* args[1]->dev_statname, blkdev8
* args[0]->b_edev, 73014444544
* args[1]->dev_pathname, /devices/pci@0,0/pci1de,fff9@3,3/pci1b96,0@0/blkdev@w0014EE81000D2E55,0
*/
io:::start
/args[0]->b_edev == 73014444544/
{
start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}
/*
* First column is read or write
* Second column is IO size
* Third column is start time in ns for the IO
* Fourth column is end time in ns for the IO
* Fifth column is how long the IO took in ns.
*/
io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
done = timestamp;
this->elapsed = done - start[args[0]->b_edev, args[0]->b_blkno];
printf("%2s %9d %9u %9u %9u\n",
args[0]->b_flags & B_READ ? "R" : "W",
args[0]->b_bcount,
start[args[0]->b_edev, args[0]->b_blkno],
done,
this->elapsed
);
start[args[0]->b_edev, args[0]->b_blkno] = 0;
}
I took the resulting data files and made a graph of the first second of IOs. The X axis is the time the IO started (with time 0 as the first IO), Y axis is the time the IO took to finish.
1 second of IOs from the fast run:
1 second of IOs from the slow run:
I like this idea, but am skeptical about getting slow / fast data from SN05 alone: because the "slow" condition has encryption enabled, I think it may be comparing apples and oranges.
Using your DTrace script on C12 versus Madrid (which are both unencrypted), I get some interesting results:
This gives credence to @jmpesp 's theory that the slow condition is flushing more often – or some other form of periodic work!
Modifying the DTrace script to capture when spa_sync
is called, it turns out that it lines up perfectly with the periodicity of the write IOs!
I seem to recall when Alan and I were looking at the write throttle stuff that was causing the slow path to go to sleep, that before sleeping it would pretty aggressively try to start syncing the txg out. So, it may not necessarily be the cause as much as a symptom.
Also it would help to fix the I/O time axis on the left so that the two graphs are the same height
I tried adjusting the write throttle by bringing zfs_dirty_data_max
from 4 GiB to 16 GiB:
mdb -kwe 'zfs_dirty_data_max/z0x400000000'
On C12 (the fast machine), this boosted performance from 650 MiB/s up to about 939 MiB/s. In the IO trace, I see that spa_sync
is called much less frequently (notice the X axis is different in this graph!)
Running dirty.d
from this blog post, I see that we're using about 1 GiB during the fio
load condition:
CPU ID FUNCTION:NAME
76 19129 txg_sync_thread:txg-syncing 1025MB of 16384MB used
103 19129 txg_sync_thread:txg-syncing 1025MB of 16384MB used
120 19129 txg_sync_thread:txg-syncing 1025MB of 16384MB used
86 19129 txg_sync_thread:txg-syncing 1025MB of 16384MB used
110 19129 txg_sync_thread:txg-syncing 1025MB of 16384MB used
On Madrid, after making the same adjustment (setting zfs_dirty_data_max
to 16 GiB), I see no change in performance – it's still about 50 MiB/s running the fio
load in an unencrypted dataset. The spa_sync
interval is also the same as before, other than what looks like a longer delay before the first spa_sync
:
Running dirty.d
, I see that it levels out at a smaller value (which isn't surprising, given that we're finishing transaction groups more frequently)
CPU ID FUNCTION:NAME
86 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
1 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
43 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
18 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
19 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
18 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
18 14915 txg_sync_thread:txg-syncing 256MB of 16384MB used
It's not clear to me why Madrid is choosing to sync transaction groups more quickly. I don't see anyone calling fsync
every 400-ish milliseconds (checked with dtrace -s 'fbt::zfs_sync:entry'
), and we're well below 60% of the dirty pool.
We've arrived at a quasi-root cause: the ARC's target size on slow machines is tiny, causing cache evictions, frequent txg syncs, and general badness.
How we got there: Madrid is syncing transaction groups because dmu_tx_try_assign
is failing with ERESTART
. Tracking this through the ZFS source, the ERESTART
is coming from dsl_dir_tempreserve_space
-> arc_tempreserve_space
-> this chunk of code:
/*
* Throttle writes when the amount of dirty data in the cache
* gets too large. We try to keep the cache less than half full
* of dirty blocks so that our sync times don't grow too large.
*
* In the case of one pool being built on another pool, we want
* to make sure we don't end up throttling the lower (backing)
* pool when the upper pool is the majority contributor to dirty
* data. To insure we make forward progress during throttling, we
* also check the current pool's net dirty data and only throttle
* if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
* data in the cache.
*
* Note: if two requests come in concurrently, we might let them
* both succeed, when one of them should fail. Not a huge deal.
*/
uint64_t total_dirty = reserve + arc_tempreserve + anon_size;
uint64_t spa_dirty_anon = spa_dirty_data(spa);
if (total_dirty > arc_c * zfs_arc_dirty_limit_percent / 100 &&
anon_size > arc_c * zfs_arc_anon_limit_percent / 100 &&
spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) {
uint64_t meta_esize =
zfs_refcount_count(
&arc_anon->arcs_esize[ARC_BUFC_METADATA]);
uint64_t data_esize =
zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);
dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
"anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
arc_tempreserve >> 10, meta_esize >> 10,
data_esize >> 10, reserve >> 10, arc_c >> 10);
return (SET_ERROR(ERESTART));
}
Looking at the ARC stats on London (kstat -m zfs -n arcstats
), the issue is that the ARC has a target size (c
) of 1 GiB, and an actual size (size
) of 50 GiB:
BRM42220034 # kstat -m zfs -n arcstats
module: zfs instance: 0
name: arcstats class: misc
access_skip 47
anon_evictable_data 0
anon_evictable_metadata 0
anon_size 3166208
arc_meta_limit 237425426432
arc_meta_max 2249081056
arc_meta_min 536870912
arc_meta_used 1795827616
async_upgrade_sync 39
c 1073741824
c_max 949701705728
c_min 1073741824
compressed_size 26597786624
crtime 70.394662000
data_size 52631310848
deleted 967716
demand_data_hits 125562715
demand_data_misses 12808504
demand_hit_predictive_prefetch 17427
demand_hit_prescient_prefetch 0
demand_metadata_hits 1138481485
demand_metadata_misses 42599983
evict_l2_cached 0
evict_l2_eligible 482089895424
evict_l2_eligible_mfu 296411142144
evict_l2_eligible_mru 185678753280
evict_l2_ineligible 766464
evict_l2_skip 0
evict_not_enough 107171
evict_skip 9692
hash_chain_max 3
hash_chains 230
hash_collisions 139875
hash_elements 254849
hash_elements_max 414225
hdr_size 96791312
hits 1264046167
... l2 details elided, since they're all zero
memory_throttle_count 0
metadata_size 1160457216
mfu_evictable_data 0
mfu_evictable_metadata 0
mfu_ghost_evictable_data 816718848
mfu_ghost_evictable_metadata 74317312
mfu_ghost_hits 141
mfu_ghost_size 891036160
mfu_hits 1167362146
mfu_size 2177381376
misses 55450820
mru_evictable_data 0
mru_evictable_metadata 0
mru_ghost_evictable_data 0
mru_ghost_evictable_metadata 0
mru_ghost_hits 5
mru_ghost_size 0
mru_hits 96682060
mru_size 51611220480
mutex_miss 11799
other_size 419793808
overhead_size 27193981440
p 617493314
prefetch_data_hits 0
prefetch_data_misses 25675
prefetch_metadata_hits 1967
prefetch_metadata_misses 16658
size 54427138464
snaptime 694642.204181386
uncompressed_size 27195558912
On C12, meanwhile, the ARC has a target size of 884 GiB (lol) and an actual size of 2 GiB (makes sense, because we're not doing much with it).
::memstat
on London reveals that we've still got 100 GiB of RAM free:
BRM42220034 # mdb -ke ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 11583280 45247 4%
Boot pages 13 0 0%
ZFS File Data 12972040 50672 5%
VMM Memory 212228096 829016 80%
Anon 1515661 5920 1%
Exec and libs 127864 499 0%
Page cache 95109 371 0%
Free (cachelist) 439759 1717 0%
Free (freelist) 26321505 102818 10%
Total 265283327 1036262
Physical 265283325 1036262
It's not clear whether the ARC is behaving as intended here, but it seems to be the root cause of our slowdown compared to C12.
Meanwhile, on C12, after we set the VMM reservoir limits, we still see an ARC target size of 884 GiB, which may also be a bug (or the same bug, with a different manifestation?)
BRM42220012 # mdb -ke ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 10352847 40440 4%
Boot pages 13 0 0%
ZFS File Data 1008656 3940 0%
Anon 32328 126 0%
Exec and libs 7818 30 0%
Page cache 4306 16 0%
Free (cachelist) 7545 29 0%
Free (freelist) 253868278 991672 96%
Total 265281791 1036256
Physical 265281789 1036256
BRM42220012 # /usr/lib/rsrvrctl -q
Free KiB: 0
Allocated KiB: 0
Transient Allocated KiB: 0
Size limit KiB: 1011881204
BRM42220012 # /usr/lib/rsrvrctl -s 829016
BRM42220012 # mdb -ke ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 10352964 40441 4%
Boot pages 13 0 0%
ZFS File Data 1008666 3940 0%
VMM Memory 212228096 829016 80%
Anon 32341 126 0%
Exec and libs 7818 30 0%
Page cache 4306 16 0%
Free (cachelist) 7555 29 0%
Free (freelist) 41640032 162656 16%
Total 265281791 1036256
Physical 265281789 1036256
BRM42220012 # kstat -m zfs -n arcstats
module: zfs instance: 0
name: arcstats class: misc
access_skip 8
anon_evictable_data 0
anon_evictable_metadata 0
anon_size 171008
arc_meta_limit 237424050176
arc_meta_max 327241928
arc_meta_min 59356012544
arc_meta_used 139695016
async_upgrade_sync 0
c 949696200704
c_max 949696200704
c_min 118712025088
compressed_size 1088078336
crtime 409.405263066
data_size 2151949312
deleted 86
demand_data_hits 1734215
demand_data_misses 160287
demand_hit_predictive_prefetch 4
demand_hit_prescient_prefetch 0
demand_metadata_hits 126546483
demand_metadata_misses 1033399
evict_l2_cached 0
evict_l2_eligible 1458688
evict_l2_eligible_mfu 0
evict_l2_eligible_mru 1458688
evict_l2_ineligible 12288
evict_l2_skip 0
evict_not_enough 0
evict_skip 27
hash_chain_max 1
hash_chains 0
hash_collisions 242
hash_elements 9692
hash_elements_max 9697
hdr_size 5906744
hits 128280698
... l2 details elided, since they're all zero
memory_throttle_count 0
metadata_size 124881408
mfu_evictable_data 0
mfu_evictable_metadata 37888
mfu_ghost_evictable_data 0
mfu_ghost_evictable_metadata 0
mfu_ghost_hits 0
mfu_ghost_size 0
mfu_hits 123513723
mfu_size 13667328
misses 1194132
mru_evictable_data 0
mru_evictable_metadata 41984
mru_ghost_evictable_data 0
mru_ghost_evictable_metadata 0
mru_ghost_hits 0
mru_ghost_size 0
mru_hits 4766975
mru_size 2262992384
mutex_miss 0
other_size 7654064
overhead_size 1188752384
p 474848100352
prefetch_data_hits 0
prefetch_data_misses 0
prefetch_metadata_hits 0
prefetch_metadata_misses 446
size 2291644328
snaptime 29515.764225355
uncompressed_size 1190047744
https://github.com/oxidecomputer/stlouis/issues/466 may also be relevant!
To also close the loop on the issue I was seeing on my bench gimlet (sn05), which we determined was slightly different. In my case, I was comparing traffice between IO to a file where one was on an encrypted dataset and the other was not.
This issue does deserve more analysis, but because we also discovered that crucible volumes were not on encrypted datasets, I feel the urgency is somewhat lowered as far as performance and crucible. The actual region where crucible data is stored does not exibit the same 10x slowdown I measured on sn05.
The Madrid mini-rack was conveniently power-cycled last night. When it came back up, I observed
kstat -m zfs -n arcstats
, c
is 949701705728 (884 GiB) and size
is 14585591576 (13 GiB)mdb -ke ::memstat
, we have the 80% VMM reservoir reservedfio
job that was slow is now fast (661 MiB/s)This lines up with the theory that ARC pressure is causing the bad fio
performance: on a "fast" machine, the whole 1 GiB file gets loaded into the ARC, while on slow machines the ARC is under high pressure.
After doing some baseline Crucible tests, I noticed that the ARC was back to clamping arc_c
at 1 GiB (on both the Upstairs and Downstairs sleds). This was surprising to me, so I decided to watch the relevant kstat
s on the Downstairs sled.
kstat -m zfs -n arcstats -s'/^c$|^size$/' 1
The resulting graph is interesting:
arc_c
grows along with size
to a point, then the two diverge and arc_c
falls to its minimum value (1 GiB) while size
continues growing until about 54 GiB.
I can also reproduce this on a single sled (C12), after configuring the VMM reservoir, dump, and swap. To reproduce on a single sled, I created a 64 GiB file from dd if=/dev/urandom of=/cru1/chonk.bin bs=1G count=64
(this took a long time).
Then, doing a long-running (> 120 second) fio
job on that file eventually fills up the ARC and shows the same divergence:
[global]
filename=/cru1/chonk.bin
time_based
runtime=180
numjobs=1
direct=1
[randwrite-4K]
bs=4K
rw=randwrite
(Unlike the Madrid Downstairs sled, arc_c
starts at 884 GiB here, because this was a completely fresh boot)
My naive impression is that this looks like a failure of the ARC size control loop: I would expect size
to be being driven to <= arc_c
, but the system remains in a state where size
is > 50 GiB and arc_c
is 1 GiB.
My expectations could be wrong, so I'm going to be looking at the ARC source code next.
One more small update from the coal face: the ARC is getting stuck at 50 GiB because there's apparently an eviction moratorium in effect.
BRM42220012 # kstat -m zfs -n arcstats | grep evictable
anon_evictable_data 0
anon_evictable_metadata 0
mfu_evictable_data 0
mfu_evictable_metadata 0
mfu_ghost_evictable_data 0
mfu_ghost_evictable_metadata 0
mru_evictable_data 0
mru_evictable_metadata 0
mru_ghost_evictable_data 0
mru_ghost_evictable_metadata 0
I'm not sure why the ARC is refusing to evict 50-some GiB of data; as far as I can tell, this means that someone is keeping a reference to the buffers.
Quoth arc.c
:
At any given moment, some subset of the blocks in the cache are un-evictable because we have handed out a reference to them. Blocks are only evictable when there are no external references active. This makes eviction far more problematic: we choose to evict the evictable blocks that are the "lowest" in the list.
There are times when it is not possible to evict the requested space. In these circumstances we are unable to adjust the cache size. To prevent the cache growing unbounded at these times we implement a "cache throttle" that slows the flow of new data into the cache until we can make space available.
This matches what we're seeing: we've got unevictable blocks, our actual cache size is well above the target, and we're being throttled.
The question is why the blocks are unevictable, since I don't know who would be keeping a reference to them.
Suspicious about bad references are exacerbated by the fact that half of the ARC data is in overhead_size = 27193981440
. I believe this represents uncompressed versions of ARC data (and it's about the same size because we're working with random data).
In the source, overhead_size
is described as:
This is classified as "overhead" since this data is typically short-lived and will be evicted from the arc when it becomes unreferenced
However, it appears to be far from short-lived: those 25 GiB of overhead are sticking around indefinitely!
Booting into a debug kernel and tweaking some knobs (see https://github.com/oxidecomputer/stlouis/issues/528 for why we have to disable prefetch)
mdb -kwe zfs_prefetch_disable/z1
mdb -kwe reference_tracking_enable/z1
We can now see reference count metadata in the Bad State (triggered by the same fio
run as before):
> ::walk arc_buf_t | ::print arc_buf_t b_hdr | ::print -a arc_buf_hdr_t b_l1hdr.b_refcnt.rc_list | ::walk list | ::print reference_t
{
ref_link = {
list_next = 0xfffffd087edae958
list_prev = 0xfffffd087edae958
}
ref_holder = 0xfffffd087edadd18
ref_number = 0x1
ref_removed = 0xbaddcafebaddcafe
}
{
ref_link = {
list_next = 0xfffffd082d97e330
list_prev = 0xfffffd082d97e330
}
ref_holder = 0xfffffd082d979d70
ref_number = 0x1
ref_removed = 0xbaddcafebaddcafe
}
{
ref_link = {
list_next = 0xfffffd081dbedbf0
list_prev = 0xfffffd081dbedbf0
}
ref_holder = 0xfffffd081dbef3e8
ref_number = 0x1
ref_removed = 0xbaddcafebaddcafe
}
{
ref_link = {
list_next = 0xfffffd07bd516328
list_prev = 0xfffffd07bd516328
>> More [<space>, <cr>, q, n, c, a] ?
It's not immediately obvious who the reference holders are, though: it's a void*
and various things can go in there depending the caller.
> ::walk arc_buf_t | ::print arc_buf_t b_hdr | ::print -a arc_buf_hdr_t b_l1hdr.b_refcnt.rc_list | ::walk list | ::print reference_t ref_holder ! sort
...lots of stuff
ref_holder = 0xfffffd1782034e48
ref_holder = 0xfffffd1782039170
ref_holder = 0xfffffd1782039400
ref_holder = 0xfffffd17820397d8
ref_holder = 0xfffffd1782039a68
ref_holder = 0xfffffd1782039cf8
ref_holder = 0xfffffd178203c020
ref_holder = 0xfffffd178203c3f8
ref_holder = 0xfffffd178203c688
ref_holder = 0xfffffd178203c918
ref_holder = 0xfffffd178203ce38
ref_holder = 0xfffffd1782042160
ref_holder = 0xfffffd17820423f0
ref_holder = 0xfffffd17820427c8
ref_holder = 0xfffffd1782042a58
ref_holder = 0xfffffd1782042ce8
ref_holder = 0xfffffd1782049010
ref_holder = 0xfffffd17820492a0
ref_holder = 0xfffffd1782049530
ref_holder = 0xfffffd1782049678
ref_holder = 0xfffffd1782f242d8
ref_holder = 0xfffffd1784fb1418
ref_holder = 0xfffffd17856b9ac8
ref_holder = 0xfffffd17856bd310
ref_holder = 0xfffffd17856bd5a0
ref_holder = 0xfffffd17856bdc08
Interestingly, the most common gap size between ref_holder
is 328 bytes. I tried to find a struct
that matched, but didn't have any luck.
Piping the ref_holder
to ::whatis
allows us to determine the cache that it's in -- and because the reference tracking necessitates a DEBUG kernel anyway, we also get full stack backtraces. Running this against a bench Gimlet that has run the prescribed fio
run is revealing:
> ::walk arc_buf_t | ::print arc_buf_t b_hdr | ::print -a arc_buf_hdr_t b_l1hdr.b_refcnt.rc_list | ::walk list | ::print reference_t ref_holder | ::whatis
...
fffffd0577ffe668 is allocated from dmu_buf_impl_t:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
fffffd0571290a60 fffffd0577ffe668 3821c979ab3 fffffd04678b20e0
fffffd03157ba008 fffffcfc7d099b00 0
kmem_cache_alloc_debug+0x37c
kmem_cache_alloc+0x253
dbuf_create+0x6b
dbuf_hold_impl+0x461
dbuf_hold_level+0x3e
dmu_tx_check_ioerr+0x56
dmu_tx_count_write+0x19c
dmu_tx_hold_write+0x98
zfs_write+0x4d0
fop_write+0x5d
pwrite+0x172
fffffd0577ffe7b0 is allocated from dmu_buf_impl_t:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
fffffd0571290988 fffffd0577ffe7b0 3821c8dffc4 fffffd04678b20e0
fffffd03157ba008 fffffcfc7d091b80 0
kmem_cache_alloc_debug+0x37c
kmem_cache_alloc+0x253
dbuf_create+0x6b
dbuf_hold_impl+0x461
dbuf_hold_level+0x3e
dmu_tx_check_ioerr+0x56
dmu_tx_count_write+0x19c
dmu_tx_hold_write+0x98
zfs_write+0x4d0
fop_write+0x5d
pwrite+0x172
fffffd0577ffe8f8 is allocated from dmu_buf_impl_t:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
fffffd057128b530 fffffd0577ffe8f8 3821c838065 fffffd04678b20e0
fffffd03157ba008 fffffcfc7d089c00 0
kmem_cache_alloc_debug+0x37c
kmem_cache_alloc+0x253
dbuf_create+0x6b
dbuf_hold_impl+0x461
dbuf_hold_level+0x3e
dmu_tx_check_ioerr+0x56
dmu_tx_count_write+0x19c
dmu_tx_hold_write+0x98
zfs_write+0x4d0
fop_write+0x5d
pwrite+0x172
fffffd0577ffea40 is allocated from dmu_buf_impl_t:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
fffffd057128b458 fffffd0577ffea40 3821c79c7fd fffffd04678b20e0
fffffd03157ba008 fffffcfc7d081c80 0
kmem_cache_alloc_debug+0x37c
kmem_cache_alloc+0x253
dbuf_create+0x6b
dbuf_hold_impl+0x461
dbuf_hold_level+0x3e
dmu_tx_check_ioerr+0x56
dmu_tx_count_write+0x19c
dmu_tx_hold_write+0x98
zfs_write+0x4d0
fop_write+0x5d
pwrite+0x172
So why are these dbufs still held? The answer is that there is a cache that effectively fronts the ARC -- the dbuf cache -- and no dbuf that is in the (LRU) dbuf cache can be evicted from the ARC. From dbuf.c
:
/*
* The LRU dbuf cache uses a three-stage eviction policy:
* - A low water marker designates when the dbuf eviction thread
* should stop evicting from the dbuf cache.
* - When we reach the maximum size (aka mid water mark), we
* signal the eviction thread to run.
* - The high water mark indicates when the eviction thread
* is unable to keep up with the incoming load and eviction must
* happen in the context of the calling thread.
*
* The dbuf cache:
* (max size)
* low water mid water hi water
* +----------------------------------------+----------+----------+
* | | | |
* | | | |
* | | | |
* | | | |
* +----------------------------------------+----------+----------+
* stop signal evict
* evicting eviction directly
* thread
*
* The high and low water marks indicate the operating range for the eviction
* thread. The low water mark is, by default, 90% of the total size of the
* cache and the high water mark is at 110% (both of these percentages can be
* changed by setting dbuf_cache_lowater_pct and dbuf_cache_hiwater_pct,
* respectively). The eviction thread will try to ensure that the cache remains
* within this range by waking up every second and checking if the cache is
* above the low water mark. The thread can also be woken up by callers adding
* elements into the cache if the cache is larger than the mid water (i.e max
* cache size). Once the eviction thread is woken up and eviction is required,
* it will continue evicting buffers until it's able to reduce the cache size
* to the low water mark. If the cache size continues to grow and hits the high
* water mark, then callers adding elements to the cache will begin to evict
* directly from the cache until the cache is no longer above the high water
* mark.
*/
There are several problems here, but one clear one: the dbuf cache sizes itself based (effectively) on physmem
-- and doesn't adjust its high- and low-water marks to memory or ARC pressure. This is particular acute with the reservoir, which (in our case) takes 80% of physical memory. Our 1TB physical leaves us with an arc_c_max
of 884 GB; the dbuf cache sizes itself at 1/32nd of this -- or ~27GB. An interesting experiment (and one that is safe) would be to take a sled that is degraded and lower its dbuf_cache_max_bytes
to (say) 5GB to see if that results in some ARC release and expansion. More to come!
Further investigation reveals that while the dbuf cache sizing is our most immediate issue (and the cause for the unevictable data), another problem -- and the reason for the low c
in the first place -- is that swapfs_minfree
is absolutely loco (128 GiB!) and resulting in the system perceiving pressure that isn't, in fact, there. This is a bug; I have filed https://www.illumos.org/issues/16202. It still feels like the dbuf clogging needs to be addressed (along with one other grievous misuse of memory that I discovered along the way); I'll get illumos bugs filed on both.
Closing the loop here, I adjusted
mdb -kwe 'swapfs_minfree/z0x20000' # 512 MiB
mdb -kwe 'dbuf_cache_max_bytes/z0x140000000' # 5 GiB
and re-ran the fio
job on the large (64 GiB) file, twice.
arc_size
grows as you'd expect, and arc_c
remains up at 884 GiB (way off the top of the graph):
The ARC shows plenty of evictable data:
BRM42220012 # kstat -m zfs -n arcstats | grep evicta
anon_evictable_data 0
anon_evictable_metadata 0
mfu_evictable_data 235405312
mfu_evictable_metadata 44544
mfu_ghost_evictable_data 20354696192
mfu_ghost_evictable_metadata 5705728
mru_evictable_data 89509593088
mru_evictable_metadata 57336320
mru_ghost_evictable_data 2082471936
mru_ghost_evictable_metadata 288724480
Hopping over and running the original fio
job (on a 1 GiB file), I see good performance of 633 MiB/s.
I think this resolves the filesystem-level performance mystery, although there are still questions remaining about networked performance; we can defer those to a separate ticket.
The illumos fixes are now shipped, so the mystery is resolved!
Last week, James commented that he wasn't seeing the same performance improvements (e.g. for #1089) on his home cluster (the "oxide-ca region").
I tested on the Madrid minirack, and confirmed that performance wasn't nearly as good (10ish MiB/s for 4K random writes, 400-500 MiB/s for 1M and 4M random writes). Some of this is inevitable due to network latency, but I discovered that even basic
fio
(on a single machine) shows 10x performance differences between fast and slow machines.(As such, this issue isn't strictly Crucible's fault, but this repo seems like a fine place for now; we can always relocate the issue later).
For these benchmarks, we are writing a 1 GiB file with
And running the following job using
fio
(version 3.34):We've since done experiments on a range of machines and datasets. Machines are listed below:
Machines fall into "fast" or "slow" buckets, where fast machines see > 600 MiB/s, and slow machines see 50-ish MiB/s performance.
Here's a quick summary of performance
Details on testing
To test unencrypted performance, we write the 1 GiB file to the root directory of a zpool associated with one of the U.2 drives. On Madrid, this would be (for example)
/oxp_1310e803-393e-4183-a075-8cad94db0a54/raw.bin
.The machines running the control plane stack also include encrypted datasets within that zpool. For example, on Madrid:
To test encrypted performance on Madrid, we'd write
raw.bin
to somewhere in the encrypted dataset, e.g./pool/ext/1310e803-393e-4183-a075-8cad94db0a54/crypt/raw.bin
.It's not (just) encryption
On Alan's bench Gimlet (sn05), we see fast performance for files in the root dataset and slow performance for files in the encrypted dataset. This suggests that encryption is at fault. However, it can't be the root cause: Madrid is slow for both unencrypted and encrypted datasets, and C12 is fast for both.
Testing on C12 requires making our own encrypted dataset, since it's not running the control plane stack:
At this point, running the
fio
script on/cru1/crypt/raw.bin
shows fast performance, despite the dataset being encrypted.Still, the behavior on Alan's Gimlet is certainly suspicious...
It's not
ashift
Testing with
ashift=9
versusashift=12
doesn't seem to make a differenceIt's not host OS version
C12 was running an older build (helios-2.0.22117), but updating to the same build as Madrid (helios-2.0.22356) didn't change anything; it was still fast. James did an audit of changes between those revision, and didn't see anything suspicious.
It's not raw disk IO speed
We investigated raw disk IO speeds using the following DTrace script:
Here's a comparison between C12 and Madrid:
There's nothing here that would explain a 10x performance difference.
Disk IO size varies
Alan captured histograms of raw disk IO size, comparing slow versus fast on his bench Gimlet:
Notice that the slow system is doing many more small disk IOs
ZIO size varies
We also see size variations at the
zio_write
level of the stack. The fast system is doing more record-size writes (128 KiB, 2**17) than the slow system. This was captured with the following DTrace script:Of course, the slow system is doing fewer writes overall. However, we also see that there are more small calls to
zio_write
in the slow system, despite fewer writes overall:(this is the same data with a log-scale Y axis)
Combining this with Alan's graphs above, it seems like the fast system is doing a better job of aggregating record-size 128 KiB writes into 1 MiB physical writes to the disk, and can therefore do more of them (?),