Open admnd opened 1 year ago
I tried to reproduce this on a quite different system, and could not, so probably need more information. Here's what I did.
Installed auditd, tried to match your config like so:
# auditctl -l
-a never,exit -F arch=b64 -S read,write,open,close,stat,fstat,mmap,munmap,brk,rt_sigaction,dup2,nanosleep,fcntl
# auditctl -E 1
enabled 1
failure 1
pid 223878
rate_limit 0
backlog_limit 8192
lost 0
backlog 0
backlog_wait_time 60000
backlog_wait_time_actual 0
loginuid_immutable 0 unlocked
Pool is a single vdev on a SATA SSD on the end of a USB-SATA bridge:
# lsscsi -l 1:0:0:0
[1:0:0:0] disk SanDisk SDSSDX480GG25 /dev/sdb
# zpool create -f -O atime=off sandisk /dev/sdb
Building Samba:
# cd /sandisk
# tar xfz ~/samba-4.18.1.tar.gz
# cd samba-4.18.1.tar.gz
# ./bootstrap/generated-dists/debian11/bootstrap.sh
# ./configure
# make -j5
So its possible this hardware just isn't generating enough load (only four cores, slow disk) and so the lock is never contended. But also I don't know auditd well; maybe I didn't even turn it on right. Might be a difference in the kernel. Lots of other variables.
So, looking at the traces, we can see that we're spending a lot of time trying to get extended attributes (__vfs_getxattr
), and from there, directory entries (zfs_get_xattrdir
and zfs_dirent_lock
). And, we're coming through zpl_prune_sb
a lot, which seems to be a helper to reclaim unused inodes and directory entries when the ARC is under pressure for metadata space.
Given all that, the "smell" to me seems like:
xattr=on
is set.I don't know Gentoo, so I'm not sure if this stage3/stage4 stuff is some sort of cgroup or otherwise constrained environment. If so, it might just need more memory.
Either way, I'm betting you have xattr=on
on the dataset, and setting it to xattr=sa
will help (it defaults to on
). zfsprops(7)
has more detail.
If I recall correctly, Linux auditing will do a fsync on every record write, which is impractically slow. I used to have a patch that mitigated this by doing batching, but I do not know where it is (although drobbins likely has it). A simple workaround (although perhaps not acceptable in security circles since it does not persist the audit logs through kernel panics and power cycles) would be to have the Linux auditing daemon write to a ZFS dataset that has sync=disabled set.
The high time spent in osq_lock is due to lock contention. The actual problem is whatever is causing the the lock contention. I would have been inclined to say that the auditing is the cause, but upon reading that we are incurring a larger performance hit from auditing than XFS makes me think otherwise. Nothing stands out to me as the cause for that. The only possibility that occurs to me is that some part of the build process is using a less efficient set of sys calls on ZFS than on XFS. Using a tracing utility to get the top 10 sys calls used during builds on ZFS. and on XFS for comparison might lead to some insight that could be used to find the cause.
I don't know Gentoo, so I'm not sure if this stage3/stage4 stuff is some sort of cgroup or otherwise constrained environment. If so, it might just need more memory.
Stage 3 is a minimal tarball used for installing Gentoo. Think of it as like Debian’s debootstrap. It puts a minimal set of files in place needed to have a working OS, although some machines need additional things (e,g. ZFS) and unlike debootstrap, the kernel is installed separately from it. Stage 4 is a stage 3 with some additional things added.
Either way, I'm betting you have
xattr=on
on the dataset, and setting it toxattr=sa
will help (it defaults toon
).zfsprops(7)
has more detail.
I would also suggest atime=off like you did in your example.
That said, I would not think this would make a difference. The issue is from contention caused by auditing.
Actually, I have an idea for a possible cause. fsync operations are slower on ZFS than on XFS by default. If the audit log is being stored on ZFS, the difference in fsync speed could explain the difference, since the auditing will block every audited syscall on a fsync operation
I imagine it might be possible for us to make fsync faster, but it would be a non-trivial change, so I would not expect speed improvements there any time soon. The main possibilities that occur to me are exploratory in nature:
This module parameter might also be slowing down fsync:
The idea behind it is to wait a little bit to enable aggregation that allows multithreaded synchronous workloads to have higher throughput, but it comes at the expense of latency and is likely harmful to auditing performance given that Linux’s single threaded auditing daemon will never have two fsync operations in flight at the same time.
Unfortunately, we enforce a minimum setting of 1 when we really should support setting 0 to allow people to disable the feature in situations where it is harmful:
https://github.com/openzfs/zfs/blob/65d10bd87c408bfa13fa27bb6ad3ecc0e2e3775b/module/zfs/zil.c#L3036
I am away at a conference, but if anyone particularly wants a patch to allow setting 0 to disable this, I probably could find time to write it tomorrow.
Thinking about it some more, maybe we should have a logbias=low-latency setting that would allow this to be disabled at the dataset level. A possibly better possibility would be to devise a way to auto detect whether a single threaded workload is being done and turn it off for a while when the workload is entirely single threaded.
Actually, I have an idea for a possible cause. fsync operations are slower on ZFS than on XFS by default. If the audit log is being stored on ZFS, the difference in fsync speed could explain the difference, since the auditing will block every audited syscall on a fsync operation
TL;DR: Your theory seems to be verified.
A bit more details:
High contention also occurs if I put the auditd logfile on a different ZFS dataset within the same pool, the same if I put it on a different pool or if I add a SLOG device (for the sake of the cause, I used a ramdisk not a real SSD) to the pool.
However, putting the auditd logfile on something that is not ZFS related (e.g. a ramdisk with an ext4 filesystem) as you suggested herebefore does not trigger this lock contention issue and compilation times are back to normal, just like if auditd has no rules to process or is not launched.
What is very interesting is that, in my case, the auditd logfile has only 5-6 lines of events logged. So even if auditd discards the events (thus logs nearly nothing in its logfile), it seems to force a fsync() or do something else that brings a high level of lock contention in.
Why this particular package (Samba) triggers the issue and some others do not? Good question. This should be dug deeper. The traces mentioned in my original message measure only a number of occurrences at the kernel level, they do not show what is going on within auditd or GCC per se.
EDIT: Typos.
So its possible this hardware just isn't generating enough load (only four cores, slow disk) and so the lock is never contended. But also I don't know auditd well; maybe I didn't even turn it on right. Might be a difference in the kernel. Lots of other variables.
Indeed, I have behind the pool 3x 2TB NVMe modules and I use "-j32" (7950X has 32 logical cores). I was able to reproduce the issue with 3 iSCSI LUNs in RAIDZ over a 10GbE network link between my workstation and the NAS. Not sure if the issue is visible with 2.5GbE or 1GbE network links.
So, looking at the traces, we can see that we're spending a lot of time trying to get extended attributes (
__vfs_getxattr
), and from there, directory entries (zfs_get_xattrdir
andzfs_dirent_lock
). And, we're coming throughzpl_prune_sb
a lot, which seems to be a helper to reclaim unused inodes and directory entries when the ARC is under pressure for metadata space.
This was my first guess (the maximum ARC size here is unlimited and it can grows). So revisiting this with a pre-allocated 16GB ARC rather than the default (8 GB) changes absolutely nothing the contention is also here :(
arcstat -a 1
time hits miss read hit% miss% dhit dmis dh% dm% phit pmis ph% pm% mhit mmis mread mh% mm% arcsz size c mfu mru mfug mrug eskip el2skip el2cach el2el el2mfu el2mru el2inel mtxmis dread pread grow need free avail waste
12:53:43 25 0 25 100 0 25 0 100 0 0 0 0 0 0 0 0 0 0 2.3G 2.3G 16G 15 10 0 0 0 0 0 0 0 0 0 0 25 0 1 0 118G 114G 24M
12:53:44 138 0 138 100 0 138 0 100 0 0 0 0 0 136 0 136 100 0 2.3G 2.3G 16G 138 0 0 0 0 0 0 0 0 0 0 0 138 0 1 0 118G 114G 24M
12:53:45 16K 7 16K 99 0 16K 7 99 0 0 0 0 0 5.7K 7 5.7K 99 0 2.3G 2.3G 16G 16K 681 0 0 0 0 0 0 0 0 0 0 16K 0 1 0 118G 114G 24M
12:53:46 19K 3 19K 99 0 19K 3 99 0 0 0 0 0 4.4K 3 4.4K 99 0 2.3G 2.3G 16G 18K 84 0 0 0 0 0 0 0 0 0 0 19K 0 1 0 118G 114G 24M
12:53:47 20K 2 20K 99 0 20K 2 99 0 0 0 0 0 5.5K 2 5.6K 99 0 2.3G 2.3G 16G 19K 561 0 0 0 0 0 0 0 0 0 0 20K 0 1 0 118G 114G 24M
12:53:48 18K 2 18K 99 0 18K 2 99 0 0 0 0 0 4.1K 2 4.1K 99 0 2.3G 2.3G 16G 18K 31 0 0 0 0 0 0 0 0 0 0 18K 0 1 0 118G 114G 24M
12:53:49 16K 3 16K 99 0 16K 3 99 0 0 0 0 0 5.2K 3 5.2K 99 0 2.3G 2.3G 16G 16K 323 0 0 0 0 0 0 0 0 0 0 16K 0 1 0 118G 113G 24M
12:53:50 721 1 722 99 0 721 1 99 0 0 0 0 0 581 1 582 99 0 2.3G 2.3G 16G 399 322 0 0 0 0 0 0 0 0 0 0 722 0 1 0 118G 113G 24M
12:53:51 14K 9 14K 99 0 14K 9 99 0 0 0 0 0 5.2K 9 5.2K 99 0 2.3G 2.3G 16G 14K 833 0 0 0 0 0 0 0 0 0 0 14K 0 1 0 117G 113G 24M
12:53:52 9.7K 4 9.8K 99 0 9.7K 4 99 0 0 0 0 0 4.6K 4 4.6K 99 0 2.4G 2.4G 16G 9.3K 421 0 0 0 0 0 0 0 0 0 0 9.8K 0 1 0 118G 113G 24M
12:53:53 6.1K 5 6.1K 99 0 6.1K 5 99 0 0 0 0 0 2.8K 5 2.8K 99 0 2.3G 2.3G 16G 5.6K 552 0 0 0 0 0 0 0 0 0 0 6.1K 0 1 0 118G 113G 24M
12:53:54 3.2K 6 3.2K 99 0 3.2K 6 99 0 0 0 0 0 2.5K 6 2.5K 99 0 2.3G 2.3G 16G 2.6K 587 0 0 0 0 0 0 0 0 0 0 3.2K 0 1 0 117G 113G 24M
12:53:55 806 0 806 100 0 806 0 100 0 0 0 0 0 21 0 21 100 0 2.3G 2.3G 16G 806 0 0 0 0 0 0 0 0 0 0 0 806 0 1 0 117G 113G 24M
12:53:56 858 0 858 100 0 858 0 100 0 0 0 0 0 126 0 126 100 0 2.3G 2.3G 16G 853 5 0 0 0 0 0 0 0 0 0 0 858 0 1 0 117G 113G 24M
12:53:57 1.0K 0 1.0K 100 0 1.0K 0 100 0 0 0 0 0 289 0 289 100 0 2.3G 2.3G 16G 953 58 0 0 0 0 0 0 0 0 0 0 1.0K 0 1 0 117G 113G 24M
12:53:58 1.0K 0 1.0K 100 0 1.0K 0 100 0 0 0 0 0 120 0 120 100 0 2.3G 2.3G 16G 988 17 0 0 0 0 0 0 0 0 0 0 1.0K 0 1 0 117G 113G 24M
12:53:59 734 0 734 100 0 734 0 100 0 0 0 0 0 0 0 0 0 0 2.3G 2.3G 16G 734 0 0 0 0 0 0 0 0 0 0 0 734 0 1 0 117G 113G 24M
12:54:00 805 0 805 100 0 805 0 100 0 0 0 0 0 24 0 24 100 0 2.3G 2.3G 16G 805 0 0 0 0 0 0 0 0 0 0 0 805 0 1 0 117G 113G 24M
12:54:01 940 0 940 100 0 940 0 100 0 0 0 0 0 161 0 161 100 0 2.3G 2.3G 16G 916 24 0 0 0 0 0 0 0 0 0 0 940 0 1 0 117G 113G 24M
12:54:02 4.5K 2 4.5K 99 0 4.5K 2 99 0 0 0 0 0 1.4K 2 1.4K 99 0 2.3G 2.3G 16G 4.0K 529 0 0 0 0 0 0 0 0 0 0 4.5K 0 1 0 117G 113G 24M
12:54:03 3.8K 0 3.8K 100 0 3.8K 0 100 0 0 0 0 0 874 0 874 100 0 2.3G 2.3G 16G 3.8K 28 0 0 0 0 0 0 0 0 0 0 3.8K 0 1 0 117G 113G 24M
12:54:04 2.6K 0 2.6K 100 0 2.6K 0 100 0 0 0 0 0 754 0 754 100 0 2.3G 2.3G 16G 2.6K 16 0 0 0 0 0 0 0 0 0 0 2.6K 0 1 0 117G 113G 24M
12:54:05 813 0 813 100 0 813 0 100 0 0 0 0 0 128 0 128 100 0 2.3G 2.3G 16G 798 15 0 0 0 0 0 0 0 0 0 0 813 0 1 0 117G 113G 24M
12:54:06 618 0 618 100 0 618 0 100 0 0 0 0 0 25 0 25 100 0 2.3G 2.3G 16G 617 1 0 0 0 0 0 0 0 0 0 0 618 0 1 0 117G 113G 24M
12:54:07 894 0 894 100 0 894 0 100 0 0 0 0 0 240 0 240 100 0 2.3G 2.3G 16G 817 77 0 0 0 0 0 0 0 0 0 0 894 0 1 0 117G 113G 24M
12:54:08 710 0 710 100 0 710 0 100 0 0 0 0 0 32 0 32 100 0 2.3G 2.3G 16G 704 6 0 0 0 0 0 0 0 0 0 0 710 0 1 0 117G 113G 24M
Either way, I'm betting you have
xattr=on
on the dataset, and setting it toxattr=sa
will help (it defaults toon
).zfsprops(7)
has more detail.
I tested that also, no joy, issue still present :)
The baseline configuration for auditd is this one:
local_events = yes
write_logs = yes
log_file = /var/log/audit/audit.log
log_group = root
log_format = ENRICHED
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
(...)
Now, if I change write_logs = yes
for write_logs = no
and have a rsyslog running => contention. As soon as rsyslog is brought down, contention is gone.
I also tried flush=NONE
but no improvements.
System information
Gentoo Linux AMD64 using Linux kernel 6.2.8 and ZFS 2.1.9 (GCC 12.2.1) CPU is an AMD 7950X (32 logical cores)
Describe the problem you're observing
When "emerging" (i.e. compiling) some packages like net-fs/samba (4.18.0 but any 4.x version should trigger the issue) AND auditd is running, osq_lock() goes into the ceiling as shown by a perf top and gcc becomes as slow as a snail:
With this issue happening, recompiling Samba 4 takes more than 90 minutes (normal build time is around 10 minutes). If I flush the auditd ruleset or if I stop auditd this behavior does not happen, the same if I put my stage 4 Gentoo environment on a non-ZFS stuff (e.g. a XFS filesystem).
Some preliminary random thoughs:
The common point in those most frequent kernel call stack frames: some ZFS system calls. For those who are unfamiliar, the last one is the one that happens the most (over 2.6 million times in a couple of seconds).
Describe how to reproduce the problem
Did not try (yet) but installing a common Linux distribution on a ZFS pool and recompiling Samba 4.x should trigger the issue.