openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

ARC throughput is halved by init_on_alloc #9910

Open adamdmoss opened 4 years ago

adamdmoss commented 4 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04.3 LTS
Linux Kernel 5.3.0-26-generic (ubuntu hwe)
Architecture x86_64
ZFS Version ZoL git master 25df8fb42ffa62e4ad3855a8cd4660eeedc80e1f
SPL Version ZoL git master 25df8fb42ffa62e4ad3855a8cd4660eeedc80e1f

Describe the problem you're observing

Bandwidth for reads from the ARC is approximately half of the bandwidth of reads from the native Linux cache when ABD scatter is enabled. ARC read speed is fairly comparable to the native Linux cache when ABD scatter is disabled (i.e. 'legacy' linear mode).

Describe how to reproduce the problem

On a 16GB system with ~12GB free: $ head --bytes=8G < /dev/urandom > testfile $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' $ time cp testfile /dev/null $ time cp testfile /dev/null $ time cp testfile /dev/null ... Repeat for the scatter-on, scatter-off, and non-ZFS cases for comparison. On my system, 8GB from a linear-ABD warm ARC takes 1.1 seconds, scatter-ABD takes 2.1 seconds, and 'native' takes 0.9 seconds.

(ARC also sometimes takes an arbitrarily high number of 'cp' repeats to get warm again in the presence of a lot of [even rather cold] native cache memory in use, hence the drop_caches above. I can file that as a separate issue if desirable.)

Apologies in advance if I cause blood to boil with the slipshod nature of this benchmark; I know it's only measuring one dimension of ARC performance (no writes, no latency measurements, no ARC<->L2ARC interactions) but it's a metric I was interested in at the time.

h1z1 commented 4 years ago

Might help to have a bit more information like the contents of

grep . /sys/module/zfs/parameters/*

Or atleast arc_max and output from arcstat.py

JulietDeltaGolf commented 4 years ago

Isn't it a known issue ?

https://github.com/zfsonlinux/zfs/issues/7896#issuecomment-495275715

ahrens commented 4 years ago

The 5.3 linux kernel adds a new feature which allows pages to be zeroed when allocating or freeing them: init_on_alloc and init_on_free. init_on_alloc is enabled by default on Ubuntu 18.04 HWE kernel. ZFS allocates and frees pages frequently (via the ABD structure), e.g. for every disk access. The additional overhead of zeroing these pages is significant. I measured a ~40% regression in performance of an uncached "zfs send ... >/dev/null".

This new "feature" can be disabled by setting init_on_alloc=0 in the GRUB kernel boot parameters, which undoes the performance regression.

Linux kernel commit: https://github.com/torvalds/linux/commit/6471384af2a6530696fc0203bafe4de41a23c9ef

adamdmoss commented 4 years ago

ZFS allocates and frees pages frequently

Would it be insane to recycle these pages rather than continually alloc-and-freeing them, at least while ABD requests are 'hot'?

I could look into this if the idea isn't struck down immediately.

ahrens commented 4 years ago

We could. The challenge would be to not reintroduce the problems that ABD solved (locking down excess memory). I think it would be possible, but obviously much easier to just turn off this new kernel feature - which I will suggest to the folks at Canonical.

sxc731 commented 3 years ago

Sincerely sorry for stating the obvious but there are quite substantial security benefits to init_on_alloc=1 as outlined here.

IMHO, it would be helpful to devise a workaround (or perhaps guideline ARC configuration options pending a code fix?) that would mitigate the perf impact on those who value both ZFS and security. Can anyone chip in?

Also, I don't think that the issue is Canonical-specific; recommendations (such as this) exist that suggest turning this feature on by default - for good reasons IMHO. I happen to use Ubuntu but it would be good to hear from ppl using other distros as to the perf impact on ZFS.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Ramalama2 commented 1 year ago

Is there still no workaround other as disabling?

ednadolski-ix commented 12 months ago

@behlendorf @amotin Replicating the OP's test on Linux 6.5 with current OpenZFS master branch, it appears that the difference for init_on_alloc=[0|1] is negligible. Suggest this be closed, provided there are no objections.

script:

# !/bin/bash
export TESTFILE="testfile"
export TESTSIZE="64G"
time head --bytes=${TESTSIZE} < /dev/urandom > ${TESTFILE}
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
time cp ${TESTFILE} /dev/null
time cp ${TESTFILE} /dev/null
time cp ${TESTFILE} /dev/null

init_on_alloc=0:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=0
[    0.197222] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=0

root@walong-test2:/mypool2/test_init_on_alloc# ./test0

real    4m10.609s
user    0m2.549s
sys     4m7.485s

real    0m13.298s
user    0m0.128s
sys     0m13.167s

real    0m10.984s
user    0m0.112s
sys     0m10.872s

real    0m10.993s
user    0m0.136s
sys     0m10.857s

init_on_alloc=1:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=1
[    0.197075] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.5.0 root=UUID=18a1c019-31e3-433b-92bf-a5809af9cdc1 ro init_on_alloc=1

root@walong-test2:/mypool2/test_init_on_alloc# ./test0

real    4m12.338s
user    0m2.520s
sys     4m8.554s

real    0m13.899s
user    0m0.148s
sys     0m13.740s

real    0m10.783s
user    0m0.120s
sys     0m10.662s

real    0m11.333s
user    0m0.176s
sys     0m11.157s
amotin commented 12 months ago

@ednadolski-ix It becomes ANYTHING BUT negligible as soon as memory bandwidth reaches saturation, AKA application traffic reaches 10-20% of it. The test you are using is just inadequate. Single write stream from /dev/urandom or even ARC read into /dev/null is just unable to saturate anything, they both are limited by single core speed at best. We've disabled it in TrueNAS SCALE (https://github.com/truenas/linux/commit/d165d39524f57bef1356dc0ca8c8911209093328) , and we did notice performance improvements. IMO it is a paranoid patch against brain dead programmers unable to properly initialize memory. The only question is whether ZFS can control it somehow for its own allocations or it has to be done by sane distributions.