poor write performance from bsdtar / git

lundman commented 7 years ago

Attempting to collect all information about this problem here.

As a test, we will use a tar file of xcode

# time tar cf /tmp/xcode.tar /Applications/Xcode.app/
real    2m

Then, as a test, we untar that on hfs:

# cd /Volume/hfs
# time tar xf /tmp/xcode.tar
real    5m

This is our target. We'd like to be at least this good

Untar the same file on ZFS dataset, (default options - nothing specified on cmdline)

This is the problem:

# cd /Volume/zfs
# time tar xf /tmp/xcode.tar
real    50m

Ouch. It is interesting to note that it starts very fast, but eventually loses it;

bash-3.2# ./cmd.sh zpool iostat 1 300
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
BOOM        8.32M  14.9G      0     32  47.8K   455K
BOOM        8.32M  14.9G      0      0      0      0
BOOM        8.32M  14.9G      0    747      0  54.6M
BOOM         132M  14.7G      0  1.15K      0   111M
BOOM         211M  14.7G      0    563      0  49.4M
BOOM         274M  14.6G      0    641      0  60.0M
BOOM         360M  14.5G      2    960  84.9K  96.0M
BOOM         468M  14.4G      2    983  21.3K   101M
BOOM         551M  14.3G      0  1.03K  20.4K  99.5M
BOOM         615M  14.3G      1    614  21.0K  60.1M
BOOM         655M  14.2G      0    440    511  42.2M
BOOM         702M  14.2G      0    415      0  35.4M
BOOM         726M  14.2G      0    362    509  32.9M
BOOM         813M  14.1G      7    871  4.50K  80.0M
BOOM         874M  14.0G      0    711      0  72.3M
BOOM         898M  14.0G      0    435    510  37.6M
BOOM         921M  14.0G      0     43      0   550K
BOOM         944M  14.0G      2    375  1.50K  22.9M
BOOM         944M  14.0G    148      0   437K      0
BOOM         944M  14.0G      0      0      0      0
BOOM         944M  14.0G      0      0    511      0
BOOM         944M  14.0G      0      0      0      0
BOOM         944M  14.0G      0      0      0      0
BOOM         944M  14.0G      0     24      0  3.07M
BOOM         944M  14.0G      0     75      0  3.40M
BOOM         944M  14.0G      0    255      0  17.8M
BOOM         944M  14.0G      0    112      0  13.9M
BOOM         982M  13.9G      7    142  4.98K  1.87M
BOOM         982M  13.9G     27      0  33.5K      0
BOOM         982M  13.9G      4      0  2.49K      0
BOOM         982M  13.9G      0      0      0      0

and the rest trickle in around ~1MB/s.

lundman commented 7 years ago

It should be observed that a straight copy of the file is very fast:

# time cp /tmp/xcode.tar /Volume/zfs
real    2m

and interestingly, also using rsync

# time rsync -ar /Applications/Xcode.app/ /Volumes/zfs/Xcode.app/
real    4m

Creating a ZVOL formatted with hfs is also fast

# zfs create -V 16G BOOM/vol
# newfs_hfs /dev/disk4
# diskutil mount /dev/disk4
# time tar xf /tmp/xcode.tar
real    5m

So the IO of ZFS (dmu, zio, vdev_disk) all appear to be quite good.

lundman commented 7 years ago

We have played with many sysctl kstat tunables;

kstat.zfs.darwin.tunable.zio_dva_throttle_enabled
kstat.zfs.darwin.tunable.zfs_arc_min 
kstat.zfs.darwin.tunable.zfs_arc_min
kstat.zfs.darwin.tunable.sync_write_max_active
kstat.zfs.darwin.tunable.async_write_max_active
kstat.zfs.darwin.tunable.sync_write_min_active
kstat.zfs.darwin.tunable.async_write_min_active
kstat.zfs.darwin.tunable.zfs_delay_min_dirty_percent
kstat.zfs.darwin.tunable.zfs_delay_scale
kstat.zfs.darwin.tunable.zfs_delay_max_ns
kstat.zfs.darwin.tunable.zfs_vdev_cache_max
kstat.zfs.darwin.tunable.async_write_max_active
kstat.zfs.darwin.tunable.async_write_min_dirty_pct
kstat.zfs.darwin.tunable.async_write_max_dirty_pct
kstat.zfs.darwin.tunable.write_gap_limit
kstat.zfs.darwin.tunable.zfs_immediate_write_sz
kstat.zfs.darwin.tunable.max_active
kstat.zfs.misc.arcstats.arc_meta_max
kstat.zfs.darwin.tunable.zfs_arc_meta_min
kstat.zfs.darwin.tunable.zfs_arc_meta_limit

lundman commented 7 years ago

Running a flamegraph during a slow period, using

dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); } 
  tick-60s { exit(0); }'  -o out.stacks

The iostat output during this period looks like

zpool iostat

``` BOOM 4.81G 10.1G 34 113 17.9K 505K BOOM 4.81G 10.1G 0 31 509 135K BOOM 4.81G 10.1G 7 90 12.4K 414K BOOM 4.81G 10.1G 0 174 0 673K BOOM 4.81G 10.1G 31 482 154K 5.11M BOOM 4.81G 10.1G 30 29 79.7K 119K BOOM 4.81G 10.1G 47 204 29.3K 871K BOOM 4.81G 10.1G 15 64 20.9K 310K BOOM 4.81G 10.1G 24 106 22.9K 664K BOOM 4.81G 10.1G 25 72 38.8K 318K BOOM 4.81G 10.1G 0 261 0 502K BOOM 4.81G 10.1G 62 397 61.1K 1.09M BOOM 4.81G 10.1G 43 115 26.8K 465K BOOM 4.81G 10.1G 72 170 42.3K 776K BOOM 4.81G 10.1G 43 123 24.4K 494K BOOM 4.81G 10.1G 55 95 28.3K 441K BOOM 4.81G 10.1G 19 95 16.4K 437K BOOM 4.81G 10.1G 31 560 18.4K 1.48M BOOM 4.81G 10.1G 46 272 26.3K 795K BOOM 4.81G 10.1G 99 197 50.2K 851K BOOM 4.81G 10.1G 96 145 48.7K 632K BOOM 4.81G 10.1G 38 86 47.0K 386K BOOM 4.81G 10.1G 8 15 19.4K 63.7K BOOM 4.81G 10.1G 46 160 23.4K 684K BOOM 4.81G 10.1G 20 478 12.8K 3.22M BOOM 4.81G 10.1G 163 704 87.0K 3.06M BOOM 4.81G 10.1G 249 1.53K 184K 6.99M BOOM 4.81G 10.1G 37 109 21.4K 489K BOOM 4.81G 10.1G 21 96 17.4K 441K BOOM 4.81G 10.1G 12 92 6.29K 405K BOOM 4.81G 10.1G 7 528 4.97K 1.63M BOOM 4.81G 10.1G 0 600 0 2.20M BOOM 4.84G 10.0G 10 796 13.9K 5.65M BOOM 4.84G 10.0G 37 276 23.9K 1.30M BOOM 4.84G 10.0G 32 169 22.4K 1.52M BOOM 4.84G 10.0G 32 34 20.9K 163K BOOM 4.84G 10.0G 41 294 26.4K 2.85M BOOM 4.85G 10.0G 35 703 30.8K 12.5M BOOM 4.85G 10.0G 57 178 46.8K 1.07M BOOM 4.85G 10.0G 47 78 31.3K 374K BOOM 4.85G 10.0G 55 251 42.3K 1.12M BOOM 4.85G 10.0G 60 182 48.7K 1.48M BOOM 4.85G 10.0G 40 130 31.8K 1.07M BOOM 4.85G 10.0G 4 267 3.98K 2.53M BOOM 4.86G 10.0G 18 534 10.9K 5.37M BOOM 4.86G 10.0G 26 235 23.4K 1.27M BOOM 4.86G 10.0G 49 258 56.7K 1.33M BOOM 4.86G 10.0G 55 312 59.1K 1.25M BOOM 4.86G 10.0G 60 343 65.1K 1.34M BOOM 4.86G 10.0G 59 444 59.6K 1.39M BOOM 4.86G 10.0G 4 808 2.98K 4.04M BOOM 4.87G 10.0G 86 261 116K 2.97M BOOM 4.87G 10.0G 14 152 8.95K 1002K BOOM 4.87G 10.0G 5 34 3.48K 2.17M BOOM 4.86G 10.0G 22 444 18.4K 22.5M BOOM 4.84G 10.0G 9 199 9.45K 2.63M BOOM 4.84G 10.0G 95 0 48.8K 31.8K BOOM 4.84G 10.0G 6 5 3.98K 23.9K BOOM 4.84G 10.0G 16 8 16.9K 103K ```

lundman commented 7 years ago

Flamegraph slow

lundman commented 7 years ago

Zooming in on the right-most mountain range. Flamegraph slow zoomed

lundman commented 7 years ago

Crude hacks tried as to see if we can influence the behaviour includes things like:

comment out mkdir for spotlight/fsevents
comment out xattr completely
comment out vget / zap_search_value
comment out vnop_readdir
comment out zio_delay_throttle
comment out zil
comment out fsync and vfs_sync
explore XNU's vnode throttle code (not triggered)
holes detection
ARC/kmem reaping
recordsize changes, atime

lundman commented 7 years ago

rottegift discovered that by simply setting:

# sysctl -w kern.maxvnodes=10000
# zpool create ....

# time tar ...
real9m14.098s

But it's not that hfs keeps a low number of vnodes, it balloons up to max in the first 30s, just like zfs. So perhaps we have some lock contention issues with the number of active znodes ?

Using z_all_nodes and forcing ZFS to keep to a low number (5000) by calling vnode_recycle() made no definitive difference. Nor are we waiting in zget in the vp==NULL case unique to OsX.

Somewhat interesting to note that if you take out arc_is_overflowing() test in arc.c you can finish considerably faster as well, even though arc is nowhere near full.

lundman commented 7 years ago

So @ahrens shared his dtrace trove https://github.com/ahrens/dtrace which could help with working out this issue, though it would seem IllumOS dtrace knows about ZFS internal structures, so we'll have to figure out how to add that to our dtrace.

lundman commented 7 years ago

A snipped of zpool -l -q running:

sweet chili

toasty nuts

lundman commented 7 years ago

spl_vmem_malloc_if_no_pressure(size_t size)
{
    // The mutex serializes concurrent callers, providing time for
    // the variables in spl_vmem_xnu_useful_bytes_free() to be updated.
    mutex_enter(&vmem_xnu_alloc_lock);
//  if (spl_vmem_xnu_useful_bytes_free() > (MAX(size,16ULL*1024ULL*1024ULL))) {
    if (1) {

@@ -4638,6 +4637,7 @@ arc_adapt(int bytes, arc_state_t *state)
 boolean_t
 arc_is_overflowing(void)
 {
+       return 1==0;
        /* Always allow at least one block of overflow */

real 9m17.638s

rottegift commented 7 years ago

On the first one (in spl, changing the spl_vmem_xnu_useful_bytes_free() test in spl_vmem_malloc_if_no_pressure), could you please grab before and after

sysctl -h kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny_bytes

(after your hack, they should be zero, but are they big before?)

fwiw i get slowdown where

kstat.spl.misc.spl_misc.vmem_conditional_allocs: 33,755
kstat.spl.misc.spl_misc.vmem_conditional_alloc_bytes: 10,894,442,496
kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny: 8
kstat.spl.misc.spl_misc.vmem_conditional_alloc_deny_bytes: 8,388,608

so at least in my own testing, half your hack above would do practically nothing.

The second one is interesting.

More on IRC. :-)

rottegift commented 7 years ago

Try spl @ https://github.com/rottegift/spl/commit/1c03d57ea1d6684cfc07ef301b09ca5f9c9ced51

( and I believe you're already doing zfs @ https://github.com/rottegift/zfs/commit/cf6e8de90e156536fae7863e6beed9f6bc6fd198 and zfs @ https://github.com/rottegift/zfs/commit/8daee7a8c26ce04d9523fa4ae1b225002509edb5 )

This tries to reap much less often.

In your "hacks" try instead of :

spl_vmem_malloc_if_no_pressure(size_t size)
{
    // The mutex serializes concurrent callers, providing time for
    // the variables in spl_vmem_xnu_useful_bytes_free() to be updated.
    mutex_enter(&vmem_xnu_alloc_lock);
//  if (spl_vmem_xnu_useful_bytes_free() > (MAX(size,16ULL*1024ULL*1024ULL))) {
    if (1) {

use the originial if but reduce the constant there from 16ULL... to 1ULL or so, see if that accomplishes a similar thing.

rottegift commented 7 years ago

For arc_is_overflowing hack try this instead:

zfs @ https://github.com/rottegift/zfs/commit/d4afba3d808c12a88a94ab64116a9aa9b04c045d

rottegift commented 7 years ago

I think we're (also) getting hit by the msleep() [in include/sys/zfs_delay.h] kicking in at the end of dmu_tx_delay().

We can push up delay_min_bytes by increasing kstat.zfs.darwin.tunable.zfs_delay_min_dirty_percent (e.g. to 85 from default of 60).

We can push down the duration of the msleep by decreasing kstat.zfs.darwin.tunable.zfs_delay_scale (e.g. to 100000 from 500000).

Additionally, some further work on preventing arc collapse:

fc6b47a76a11eb343d18ae3e268ffa9024112275 0d49c2afcc3d713689e4e3f4afdd1755ed94cc8d

Finally,

time star -no-fsync -fifostats -fifo 'fs=256m' -x -f .../xcode.tar vs time tar -b 512 -x -f .../xcode.tar

is still interesting.

rottegift commented 7 years ago

(And boosting kstat.zfs.darwin.tunable.zfs_dirty_data_sync also helped enormously (e.g., double it).)

rottegift commented 7 years ago

Should

DMU_TX_STAT_BUMP(dmu_tx_dirty_delay);

in dmu_tx_delay, if we end up ultimately calling the msleep(); the stat dmu_tx_delay is already used in txg_delay (odd, but oh well).

rottegift commented 7 years ago

@lundman untested, but try 07be727c2662e1b35ee7876f0bcdcaaed3db3252

that should defang the cv_wait that we do when arc_is_overflowing().

rottegift commented 7 years ago

Ok, we've really narrowed it down to that cv_wait.

This is hopefully a general fix for the throughput killing it can do.

(I accidentally force pushed a fixup commit that should have been a standalone commit, so there may be redundancy in this commit).

96e507501b97fcd75382be0d67cc167a237c346b

evansus commented 7 years ago

@lundman saw you checked for xnu vnode throttle, but have you also checked if it's hitting the xnu throttle for memory allocations?

B/c this sounds very similar to what I observed when trying out zstd and lz4hc. Both needed large buffers >128k for compression, and would hit the xnu throttle within a few seconds of sustained writes due to repeated allocations. Especially when working with many files >64k or a single large file with many 128k blocks. Same outcome: fast individual writes, stalls/slowdowns after around 30 seconds of sustained write.

The workaround was to make kmem_caches for different buffer sizes and use kmem_cache_alloc instead of kmem_alloc.

A spindump for 60-120 seconds at the beginning of the tar should show that if so, lots of stacks with kmem_alloc -> vmem_alloc -> xnu_throttle_alloc.

So depending which compression mode is set it might be waiting for the compression thread. Or with the arc changes, arc may be trying to grow and throttling on 128k or larger allocations...

P.S. Another test: try wrapping # time cp /tmp/xcode.tar /Volume/zfs in a for loop and run it 20 times, it may slow down after the first few loops.

evansus commented 7 years ago

FreeBSD now has zstd and used a similar method as I did. kmem_caches for large allocations of a few sizes, but not as fine-grained as I had attempted - smaller block sizes can use much smaller buffers.

evansus commented 7 years ago

https://reviews.freebsd.org/D11124

evansus commented 7 years ago

With one-size-fits-all allocations, even 4k and smaller files would quickly hit the xnu throttle. But anyway...

evansus commented 7 years ago

btw FreeBSD has not implemented zstd in ZFS, just zstd command line tools and library. But it's written to be able to use in kernel, ready to be used by ZFS.

lundman commented 7 years ago

There is that chance, but in this case

                                                                   *470  vmem_alloc + 173 (spl-vmem.c:1729,11 in spl + 68311) [0xffffff7f82295ad7]
                                                                     *470  vmem_xalloc + 2162 (spl-vmem.c:1524,13 in spl + 64326) [0xffffff7f82294b46]
                                                                       *249  xnu_alloc_throttled + 1190 (spl-vmem.c:2385,4 in spl + 78380) [0xffffff7f8229822c]
                                                                         *246  delay_for_interval + 107 (kernel + 944987) [0xffffff80002e6b5b]
                                                                           *237

it hit our own throttle, that kicks in before xnu so we are the boss. What was interesting about the tar problem is the arc was depleted, so empty there was an echo, even on a tiny 2G VM, as it was freaking out and trying to constantly reap. It is better now, not as good hfs mind. But if you disable ZFS sync, you get very close to hfs speeds.

@evansus I don't suppose you'd had a chance to check out issue116_lite? It is very close to functional, but partitions do not show in diskutil.

We may also need to check on git speeds.

openzfsonosx / zfs

poor write performance from bsdtar / git #584