Open lundman opened 7 years ago
It should be observed that a straight copy of the file is very fast:
# time cp /tmp/xcode.tar /Volume/zfs
real 2m
and interestingly, also using rsync
# time rsync -ar /Applications/Xcode.app/ /Volumes/zfs/Xcode.app/
real 4m
Creating a ZVOL
formatted with hfs
is also fast
# zfs create -V 16G BOOM/vol
# newfs_hfs /dev/disk4
# diskutil mount /dev/disk4
# time tar xf /tmp/xcode.tar
real 5m
So the IO of ZFS (dmu, zio, vdev_disk) all appear to be quite good.
We have played with many sysctl kstat tunables;
kstat.zfs.darwin.tunable.zio_dva_throttle_enabled
kstat.zfs.darwin.tunable.zfs_arc_min
kstat.zfs.darwin.tunable.zfs_arc_min
kstat.zfs.darwin.tunable.sync_write_max_active
kstat.zfs.darwin.tunable.async_write_max_active
kstat.zfs.darwin.tunable.sync_write_min_active
kstat.zfs.darwin.tunable.async_write_min_active
kstat.zfs.darwin.tunable.zfs_delay_min_dirty_percent
kstat.zfs.darwin.tunable.zfs_delay_scale
kstat.zfs.darwin.tunable.zfs_delay_max_ns
kstat.zfs.darwin.tunable.zfs_vdev_cache_max
kstat.zfs.darwin.tunable.async_write_max_active
kstat.zfs.darwin.tunable.async_write_min_dirty_pct
kstat.zfs.darwin.tunable.async_write_max_dirty_pct
kstat.zfs.darwin.tunable.write_gap_limit
kstat.zfs.darwin.tunable.zfs_immediate_write_sz
kstat.zfs.darwin.tunable.max_active
kstat.zfs.misc.arcstats.arc_meta_max
kstat.zfs.darwin.tunable.zfs_arc_meta_min
kstat.zfs.darwin.tunable.zfs_arc_meta_limit
Running a flamegraph during a slow period, using
dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); }
tick-60s { exit(0); }' -o out.stacks
The iostat output during this period looks like
``` BOOM 4.81G 10.1G 34 113 17.9K 505K BOOM 4.81G 10.1G 0 31 509 135K BOOM 4.81G 10.1G 7 90 12.4K 414K BOOM 4.81G 10.1G 0 174 0 673K BOOM 4.81G 10.1G 31 482 154K 5.11M BOOM 4.81G 10.1G 30 29 79.7K 119K BOOM 4.81G 10.1G 47 204 29.3K 871K BOOM 4.81G 10.1G 15 64 20.9K 310K BOOM 4.81G 10.1G 24 106 22.9K 664K BOOM 4.81G 10.1G 25 72 38.8K 318K BOOM 4.81G 10.1G 0 261 0 502K BOOM 4.81G 10.1G 62 397 61.1K 1.09M BOOM 4.81G 10.1G 43 115 26.8K 465K BOOM 4.81G 10.1G 72 170 42.3K 776K BOOM 4.81G 10.1G 43 123 24.4K 494K BOOM 4.81G 10.1G 55 95 28.3K 441K BOOM 4.81G 10.1G 19 95 16.4K 437K BOOM 4.81G 10.1G 31 560 18.4K 1.48M BOOM 4.81G 10.1G 46 272 26.3K 795K BOOM 4.81G 10.1G 99 197 50.2K 851K BOOM 4.81G 10.1G 96 145 48.7K 632K BOOM 4.81G 10.1G 38 86 47.0K 386K BOOM 4.81G 10.1G 8 15 19.4K 63.7K BOOM 4.81G 10.1G 46 160 23.4K 684K BOOM 4.81G 10.1G 20 478 12.8K 3.22M BOOM 4.81G 10.1G 163 704 87.0K 3.06M BOOM 4.81G 10.1G 249 1.53K 184K 6.99M BOOM 4.81G 10.1G 37 109 21.4K 489K BOOM 4.81G 10.1G 21 96 17.4K 441K BOOM 4.81G 10.1G 12 92 6.29K 405K BOOM 4.81G 10.1G 7 528 4.97K 1.63M BOOM 4.81G 10.1G 0 600 0 2.20M BOOM 4.84G 10.0G 10 796 13.9K 5.65M BOOM 4.84G 10.0G 37 276 23.9K 1.30M BOOM 4.84G 10.0G 32 169 22.4K 1.52M BOOM 4.84G 10.0G 32 34 20.9K 163K BOOM 4.84G 10.0G 41 294 26.4K 2.85M BOOM 4.85G 10.0G 35 703 30.8K 12.5M BOOM 4.85G 10.0G 57 178 46.8K 1.07M BOOM 4.85G 10.0G 47 78 31.3K 374K BOOM 4.85G 10.0G 55 251 42.3K 1.12M BOOM 4.85G 10.0G 60 182 48.7K 1.48M BOOM 4.85G 10.0G 40 130 31.8K 1.07M BOOM 4.85G 10.0G 4 267 3.98K 2.53M BOOM 4.86G 10.0G 18 534 10.9K 5.37M BOOM 4.86G 10.0G 26 235 23.4K 1.27M BOOM 4.86G 10.0G 49 258 56.7K 1.33M BOOM 4.86G 10.0G 55 312 59.1K 1.25M BOOM 4.86G 10.0G 60 343 65.1K 1.34M BOOM 4.86G 10.0G 59 444 59.6K 1.39M BOOM 4.86G 10.0G 4 808 2.98K 4.04M BOOM 4.87G 10.0G 86 261 116K 2.97M BOOM 4.87G 10.0G 14 152 8.95K 1002K BOOM 4.87G 10.0G 5 34 3.48K 2.17M BOOM 4.86G 10.0G 22 444 18.4K 22.5M BOOM 4.84G 10.0G 9 199 9.45K 2.63M BOOM 4.84G 10.0G 95 0 48.8K 31.8K BOOM 4.84G 10.0G 6 5 3.98K 23.9K BOOM 4.84G 10.0G 16 8 16.9K 103K ```
Zooming in on the right-most mountain range.
Crude hacks tried as to see if we can influence the behaviour includes things like:
rottegift discovered that by simply setting:
# sysctl -w kern.maxvnodes=10000
# zpool create ....
# time tar ...
real9m14.098s
But it's not that hfs
keeps a low number of vnodes
, it balloons up to max in the first 30s, just like zfs
. So perhaps we have some lock contention issues with the number of active znodes
?
Using z_all_nodes
and forcing ZFS to keep to a low number (5000) by calling vnode_recycle()
made no definitive difference. Nor are we waiting in zget
in the vp==NULL
case unique to OsX.
Somewhat interesting to note that if you take out arc_is_overflowing()
test in arc.c
you can finish considerably faster as well, even though arc is nowhere near full.
So @ahrens shared his dtrace trove https://github.com/ahrens/dtrace which could help with working out this issue, though it would seem IllumOS dtrace knows about ZFS internal structures, so we'll have to figure out how to add that to our dtrace.
A snipped of zpool -l -q
running:
spl_vmem_malloc_if_no_pressure(size_t size)
{
// The mutex serializes concurrent callers, providing time for
// the variables in spl_vmem_xnu_useful_bytes_free() to be updated.
mutex_enter(&vmem_xnu_alloc_lock);
// if (spl_vmem_xnu_useful_bytes_free() > (MAX(size,16ULL*1024ULL*1024ULL))) {
if (1) {
@@ -4638,6 +4637,7 @@ arc_adapt(int bytes, arc_state_t *state)
boolean_t
arc_is_overflowing(void)
{
+ return 1==0;
/* Always allow at least one block of overflow */
real 9m17.638s
On the first one (in spl, changing the spl_vmem_xnu_useful_bytes_free() test in spl_vmem_malloc_if_no_pressure), could you please grab before and after
sysctl -h kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny_bytes
(after your hack, they should be zero, but are they big before?)
fwiw i get slowdown where
kstat.spl.misc.spl_misc.vmem_conditional_allocs: 33,755
kstat.spl.misc.spl_misc.vmem_conditional_alloc_bytes: 10,894,442,496
kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny: 8
kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny_bytes: 8,388,608
so at least in my own testing, half your hack above would do practically nothing.
The second one is interesting.
More on IRC. :-)
Try spl @ https://github.com/rottegift/spl/commit/1c03d57ea1d6684cfc07ef301b09ca5f9c9ced51
( and I believe you're already doing zfs @ https://github.com/rottegift/zfs/commit/cf6e8de90e156536fae7863e6beed9f6bc6fd198 and zfs @ https://github.com/rottegift/zfs/commit/8daee7a8c26ce04d9523fa4ae1b225002509edb5 )
This tries to reap much less often.
In your "hacks" try instead of :
spl_vmem_malloc_if_no_pressure(size_t size)
{
// The mutex serializes concurrent callers, providing time for
// the variables in spl_vmem_xnu_useful_bytes_free() to be updated.
mutex_enter(&vmem_xnu_alloc_lock);
// if (spl_vmem_xnu_useful_bytes_free() > (MAX(size,16ULL*1024ULL*1024ULL))) {
if (1) {
use the originial if but reduce the constant there from 16ULL... to 1ULL or so, see if that accomplishes a similar thing.
For arc_is_overflowing hack try this instead:
zfs @ https://github.com/rottegift/zfs/commit/d4afba3d808c12a88a94ab64116a9aa9b04c045d
I think we're (also) getting hit by the msleep() [in include/sys/zfs_delay.h] kicking in at the end of dmu_tx_delay().
We can push up delay_min_bytes by increasing kstat.zfs.darwin.tunable.zfs_delay_min_dirty_percent (e.g. to 85 from default of 60).
We can push down the duration of the msleep by decreasing kstat.zfs.darwin.tunable.zfs_delay_scale (e.g. to 100000 from 500000).
Additionally, some further work on preventing arc collapse:
fc6b47a76a11eb343d18ae3e268ffa9024112275 0d49c2afcc3d713689e4e3f4afdd1755ed94cc8d
Finally,
time star -no-fsync -fifostats -fifo 'fs=256m' -x -f .../xcode.tar vs time tar -b 512 -x -f .../xcode.tar
is still interesting.
(And boosting kstat.zfs.darwin.tunable.zfs_dirty_data_sync also helped enormously (e.g., double it).)
Should
DMU_TX_STAT_BUMP(dmu_tx_dirty_delay);
in dmu_tx_delay, if we end up ultimately calling the msleep(); the stat dmu_tx_delay is already used in txg_delay (odd, but oh well).
@lundman untested, but try 07be727c2662e1b35ee7876f0bcdcaaed3db3252
that should defang the cv_wait that we do when arc_is_overflowing().
Ok, we've really narrowed it down to that cv_wait.
This is hopefully a general fix for the throughput killing it can do.
(I accidentally force pushed a fixup commit that should have been a standalone commit, so there may be redundancy in this commit).
96e507501b97fcd75382be0d67cc167a237c346b
@lundman saw you checked for xnu vnode throttle, but have you also checked if it's hitting the xnu throttle for memory allocations?
B/c this sounds very similar to what I observed when trying out zstd and lz4hc. Both needed large buffers >128k for compression, and would hit the xnu throttle within a few seconds of sustained writes due to repeated allocations. Especially when working with many files >64k or a single large file with many 128k blocks. Same outcome: fast individual writes, stalls/slowdowns after around 30 seconds of sustained write.
The workaround was to make kmem_caches for different buffer sizes and use kmem_cache_alloc instead of kmem_alloc.
A spindump for 60-120 seconds at the beginning of the tar should show that if so, lots of stacks with kmem_alloc -> vmem_alloc -> xnu_throttle_alloc.
So depending which compression mode is set it might be waiting for the compression thread. Or with the arc changes, arc may be trying to grow and throttling on 128k or larger allocations...
P.S. Another test: try wrapping
# time cp /tmp/xcode.tar /Volume/zfs
in a for loop and run it 20 times, it may slow down after the first few loops.
FreeBSD now has zstd and used a similar method as I did. kmem_caches for large allocations of a few sizes, but not as fine-grained as I had attempted - smaller block sizes can use much smaller buffers.
With one-size-fits-all allocations, even 4k and smaller files would quickly hit the xnu throttle. But anyway...
btw FreeBSD has not implemented zstd in ZFS, just zstd command line tools and library. But it's written to be able to use in kernel, ready to be used by ZFS.
There is that chance, but in this case
*470 vmem_alloc + 173 (spl-vmem.c:1729,11 in spl + 68311) [0xffffff7f82295ad7]
*470 vmem_xalloc + 2162 (spl-vmem.c:1524,13 in spl + 64326) [0xffffff7f82294b46]
*249 xnu_alloc_throttled + 1190 (spl-vmem.c:2385,4 in spl + 78380) [0xffffff7f8229822c]
*246 delay_for_interval + 107 (kernel + 944987) [0xffffff80002e6b5b]
*237
it hit our own throttle, that kicks in before xnu so we are the boss. What was interesting about the tar problem is the arc was depleted, so empty there was an echo, even on a tiny 2G VM, as it was freaking out and trying to constantly reap. It is better now, not as good hfs mind. But if you disable ZFS sync, you get very close to hfs speeds.
@evansus I don't suppose you'd had a chance to check out issue116_lite? It is very close to functional, but partitions do not show in diskutil.
We may also need to check on git speeds.
Attempting to collect all information about this problem here.
As a test, we will use a tar file of
xcode
Then, as a test, we untar that on
hfs
:This is our target. We'd like to be at least this good
Untar the same file on
ZFS
dataset, (default options - nothing specified on cmdline)This is the problem:
Ouch. It is interesting to note that it starts very fast, but eventually loses it;
and the rest trickle in around ~1MB/s.