Open adamdmoss opened 4 years ago
Might help to have a bit more information like the contents of
grep . /sys/module/zfs/parameters/*
Or atleast arc_max and output from arcstat.py
Isn't it a known issue ?
https://github.com/zfsonlinux/zfs/issues/7896#issuecomment-495275715
The 5.3 linux kernel adds a new feature which allows pages to be zeroed when allocating or freeing them: init_on_alloc and init_on_free. init_on_alloc is enabled by default on Ubuntu 18.04 HWE kernel. ZFS allocates and frees pages frequently (via the ABD structure), e.g. for every disk access. The additional overhead of zeroing these pages is significant. I measured a ~40% regression in performance of an uncached "zfs send ... >/dev/null".
This new "feature" can be disabled by setting init_on_alloc=0 in the GRUB kernel boot parameters, which undoes the performance regression.
Linux kernel commit: https://github.com/torvalds/linux/commit/6471384af2a6530696fc0203bafe4de41a23c9ef
ZFS allocates and frees pages frequently
Would it be insane to recycle these pages rather than continually alloc-and-freeing them, at least while ABD requests are 'hot'?
I could look into this if the idea isn't struck down immediately.
We could. The challenge would be to not reintroduce the problems that ABD solved (locking down excess memory). I think it would be possible, but obviously much easier to just turn off this new kernel feature - which I will suggest to the folks at Canonical.
Sincerely sorry for stating the obvious but there are quite substantial security benefits to init_on_alloc=1
as outlined here.
IMHO, it would be helpful to devise a workaround (or perhaps guideline ARC configuration options pending a code fix?) that would mitigate the perf impact on those who value both ZFS and security. Can anyone chip in?
Also, I don't think that the issue is Canonical-specific; recommendations (such as this) exist that suggest turning this feature on by default - for good reasons IMHO. I happen to use Ubuntu but it would be good to hear from ppl using other distros as to the perf impact on ZFS.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Is there still no workaround other as disabling?
@behlendorf @amotin Replicating the OP's test on Linux 6.5 with current OpenZFS master branch, it appears that the difference for init_on_alloc=[0|1] is negligible. Suggest this be closed, provided there are no objections.
script:
# !/bin/bash
export TESTFILE="testfile"
export TESTSIZE="64G"
time head --bytes=${TESTSIZE} < /dev/urandom > ${TESTFILE}
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
time cp ${TESTFILE} /dev/null
time cp ${TESTFILE} /dev/null
time cp ${TESTFILE} /dev/null
init_on_alloc=0:
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=0
[ 0.197222] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=0
root@walong-test2:/mypool2/test_init_on_alloc# ./test0
real 4m10.609s
user 0m2.549s
sys 4m7.485s
real 0m13.298s
user 0m0.128s
sys 0m13.167s
real 0m10.984s
user 0m0.112s
sys 0m10.872s
real 0m10.993s
user 0m0.136s
sys 0m10.857s
init_on_alloc=1:
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=1
[ 0.197075] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=1
root@walong-test2:/mypool2/test_init_on_alloc# ./test0
real 4m12.338s
user 0m2.520s
sys 4m8.554s
real 0m13.899s
user 0m0.148s
sys 0m13.740s
real 0m10.783s
user 0m0.120s
sys 0m10.662s
real 0m11.333s
user 0m0.176s
sys 0m11.157s
@ednadolski-ix It becomes ANYTHING BUT negligible as soon as memory bandwidth reaches saturation, AKA application traffic reaches 10-20% of it. The test you are using is just inadequate. Single write stream from /dev/urandom or even ARC read into /dev/null is just unable to saturate anything, they both are limited by single core speed at best. We've disabled it in TrueNAS SCALE (https://github.com/truenas/linux/commit/d165d39524f57bef1356dc0ca8c8911209093328) , and we did notice performance improvements. IMO it is a paranoid patch against brain dead programmers unable to properly initialize memory. The only question is whether ZFS can control it somehow for its own allocations or it has to be done by sane distributions.
System information
Describe the problem you're observing
Bandwidth for reads from the ARC is approximately half of the bandwidth of reads from the native Linux cache when ABD scatter is enabled. ARC read speed is fairly comparable to the native Linux cache when ABD scatter is disabled (i.e. 'legacy' linear mode).
Describe how to reproduce the problem
On a 16GB system with ~12GB free: $ head --bytes=8G < /dev/urandom > testfile $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time cp testfile /dev/null $ time cp testfile /dev/null $ time cp testfile /dev/null ... Repeat for the scatter-on, scatter-off, and non-ZFS cases for comparison. On my system, 8GB from a linear-ABD warm ARC takes 1.1 seconds, scatter-ABD takes 2.1 seconds, and 'native' takes 0.9 seconds.
(ARC also sometimes takes an arbitrarily high number of 'cp' repeats to get warm again in the presence of a lot of [even rather cold] native cache memory in use, hence the drop_caches above. I can file that as a separate issue if desirable.)
Apologies in advance if I cause blood to boil with the slipshod nature of this benchmark; I know it's only measuring one dimension of ARC performance (no writes, no latency measurements, no ARC<->L2ARC interactions) but it's a metric I was interested in at the time.