openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.65k stars 1.76k forks source link

ZFS "drowns" Linux kernel audit #14697

Open admnd opened 1 year ago

admnd commented 1 year ago

System information

Gentoo Linux AMD64 using Linux kernel 6.2.8 and ZFS 2.1.9 (GCC 12.2.1) CPU is an AMD 7950X (32 logical cores)

Describe the problem you're observing

When "emerging" (i.e. compiling) some packages like net-fs/samba (4.18.0 but any 4.x version should trigger the issue) AND auditd is running, osq_lock() goes into the ceiling as shown by a perf top and gcc becomes as slow as a snail:

    76.76%  [kernel]          [k] osq_lock
     2.10%  [kernel]          [k] mutex_spin_on_owner
     1.41%  [kernel]          [k] native_sched_clock
     1.30%  [kernel]          [k] find_busiest_group
     0.74%  [kernel]          [k] psi_group_change
     0.54%  [kernel]          [k] _raw_spin_lock_irqsave
     0.52%  [kernel]          [k] __schedule
     0.51%  [kernel]          [k] update_rq_clock.part.0
     0.48%  [kernel]          [k] __update_load_avg_cfs_rq
     0.43%  [kernel]          [k] update_load_avg

With this issue happening, recompiling Samba 4 takes more than 90 minutes (normal build time is around 10 minutes). If I flush the auditd ruleset or if I stop auditd this behavior does not happen, the same if I put my stage 4 Gentoo environment on a non-ZFS stuff (e.g. a XFS filesystem).

Some preliminary random thoughs:

# auditctl -l  (by default as provided by sys-process/audit)
-a never,exit -F arch=b32 -S read,write,open,close,brk,fcntl,dup2,mmap,munmap,stat,fstat,nanosleep,rt_sigaction
-a never,exit -F arch=b64 -S read,write,open,close,stat,fstat,mmap,munmap,brk,rt_sigaction,dup2,nanosleep,fcntl

# bpftrace -e 'kprobe:osq_lock { @[arg0] = count(); }'
Attaching 1 probe...
^C

@[-113557655720192]: 1
@[-113461265342996]: 1
@[-113473551484556]: 1
@[-113557664708716]: 1
(...)
@[-113557216933532]: 6423
@[-113557216934140]: 8925
@[-113557216933836]: 11901
@[-113557216935052]: 14951
@[-113457054216836]: 90877916  <=== OUTCH!

# bpftrace -e 'kprobe:osq_lock /arg0 == -113457054216836/ { @[kstack(10)] = count() }'
Attaching 1 probe...
^C
(...)
@[
    osq_lock+1
    __mutex_lock.constprop.0+369
    zfs_dirent_lock+189
    zfs_get_xattrdir+222
    zfs_lookup+226
    zpl_prune_sb+475
    zpl_prune_sb+3105
    zpl_prune_sb+3518
    __vfs_getxattr+126
    get_vfs_caps_from_disk+96
]: 847095
@[
    osq_lock+1
    __mutex_lock.constprop.0+369
    zfs_dirent_unlock+32
    zfs_get_xattrdir+291
    zfs_lookup+226
    zpl_prune_sb+475
    zpl_prune_sb+3105
    zpl_prune_sb+3518
    __vfs_getxattr+126
    get_vfs_caps_from_disk+96
]: 982700
@[
    osq_lock+1
    __mutex_lock.constprop.0+369
    zfs_getattr_fast+104
    zio_do_crypt_abd+10778
    vfs_statx+160
    vfs_fstatat+81
    __do_sys_newfstatat+19
    do_syscall_64+58
    entry_SYSCALL_64_after_hwframe+99
]: 1005841
@[
    osq_lock+1
    __mutex_lock.constprop.0+369
    zpl_prune_sb+153
    zpl_prune_sb+3243
    zpl_prune_sb+3518
    __vfs_getxattr+126
    get_vfs_caps_from_disk+96
    audit_copy_inode+123
    filename_lookup+339
    vfs_statx+124
]: 1047785
@[
    osq_lock+1
    __mutex_lock.constprop.0+369
    __cv_broadcast+231
    zfs_dirent_lock+348
    zfs_get_xattrdir+222
    zfs_lookup+226
    zpl_prune_sb+475
    zpl_prune_sb+3105
    zpl_prune_sb+3518
    __vfs_getxattr+126
]: 26117572

The common point in those most frequent kernel call stack frames: some ZFS system calls. For those who are unfamiliar, the last one is the one that happens the most (over 2.6 million times in a couple of seconds).

Describe how to reproduce the problem

  1. Put a Gentoo stage 3 (or 4) on a ZFS data set (e.g zpool1/mystage4)
  2. Mount everything required (mount -o bind /proc /zpool1/mystage4/proc, mount -o bind /sys /zpool1/mystage4/sys and so on)
  3. chroot there (chroot /zpool1/mystage4 /bin/bash)
  4. env-update
  5. emerge net-fs/samba

Did not try (yet) but installing a common Linux distribution on a ZFS pool and recompiling Samba 4.x should trigger the issue.

robn commented 1 year ago

I tried to reproduce this on a quite different system, and could not, so probably need more information. Here's what I did.

Installed auditd, tried to match your config like so:

# auditctl -l
-a never,exit -F arch=b64 -S read,write,open,close,stat,fstat,mmap,munmap,brk,rt_sigaction,dup2,nanosleep,fcntl
# auditctl -E 1
enabled 1
failure 1
pid 223878
rate_limit 0
backlog_limit 8192
lost 0
backlog 0
backlog_wait_time 60000
backlog_wait_time_actual 0
loginuid_immutable 0 unlocked

Pool is a single vdev on a SATA SSD on the end of a USB-SATA bridge:

# lsscsi -l 1:0:0:0
[1:0:0:0]    disk    SanDisk  SDSSDX480GG25          /dev/sdb
# zpool create -f -O atime=off sandisk /dev/sdb

Building Samba:

# cd /sandisk
# tar xfz ~/samba-4.18.1.tar.gz
# cd samba-4.18.1.tar.gz
# ./bootstrap/generated-dists/debian11/bootstrap.sh
# ./configure
# make -j5

So its possible this hardware just isn't generating enough load (only four cores, slow disk) and so the lock is never contended. But also I don't know auditd well; maybe I didn't even turn it on right. Might be a difference in the kernel. Lots of other variables.


So, looking at the traces, we can see that we're spending a lot of time trying to get extended attributes (__vfs_getxattr), and from there, directory entries (zfs_get_xattrdir and zfs_dirent_lock). And, we're coming through zpl_prune_sb a lot, which seems to be a helper to reclaim unused inodes and directory entries when the ARC is under pressure for metadata space.

Given all that, the "smell" to me seems like:

I don't know Gentoo, so I'm not sure if this stage3/stage4 stuff is some sort of cgroup or otherwise constrained environment. If so, it might just need more memory.

Either way, I'm betting you have xattr=on on the dataset, and setting it to xattr=sa will help (it defaults to on). zfsprops(7) has more detail.

ryao commented 1 year ago

If I recall correctly, Linux auditing will do a fsync on every record write, which is impractically slow. I used to have a patch that mitigated this by doing batching, but I do not know where it is (although drobbins likely has it). A simple workaround (although perhaps not acceptable in security circles since it does not persist the audit logs through kernel panics and power cycles) would be to have the Linux auditing daemon write to a ZFS dataset that has sync=disabled set.

The high time spent in osq_lock is due to lock contention. The actual problem is whatever is causing the the lock contention. I would have been inclined to say that the auditing is the cause, but upon reading that we are incurring a larger performance hit from auditing than XFS makes me think otherwise. Nothing stands out to me as the cause for that. The only possibility that occurs to me is that some part of the build process is using a less efficient set of sys calls on ZFS than on XFS. Using a tracing utility to get the top 10 sys calls used during builds on ZFS. and on XFS for comparison might lead to some insight that could be used to find the cause.

I don't know Gentoo, so I'm not sure if this stage3/stage4 stuff is some sort of cgroup or otherwise constrained environment. If so, it might just need more memory.

Stage 3 is a minimal tarball used for installing Gentoo. Think of it as like Debian’s debootstrap. It puts a minimal set of files in place needed to have a working OS, although some machines need additional things (e,g. ZFS) and unlike debootstrap, the kernel is installed separately from it. Stage 4 is a stage 3 with some additional things added.

Either way, I'm betting you have xattr=on on the dataset, and setting it to xattr=sa will help (it defaults to on). zfsprops(7) has more detail.

I would also suggest atime=off like you did in your example.

That said, I would not think this would make a difference. The issue is from contention caused by auditing.

ryao commented 1 year ago

Actually, I have an idea for a possible cause. fsync operations are slower on ZFS than on XFS by default. If the audit log is being stored on ZFS, the difference in fsync speed could explain the difference, since the auditing will block every audited syscall on a fsync operation

I imagine it might be possible for us to make fsync faster, but it would be a non-trivial change, so I would not expect speed improvements there any time soon. The main possibilities that occur to me are exploratory in nature:

ryao commented 1 year ago

This module parameter might also be slowing down fsync:

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-commit-timeout-pct

The idea behind it is to wait a little bit to enable aggregation that allows multithreaded synchronous workloads to have higher throughput, but it comes at the expense of latency and is likely harmful to auditing performance given that Linux’s single threaded auditing daemon will never have two fsync operations in flight at the same time.

Unfortunately, we enforce a minimum setting of 1 when we really should support setting 0 to allow people to disable the feature in situations where it is harmful:

https://github.com/openzfs/zfs/blob/65d10bd87c408bfa13fa27bb6ad3ecc0e2e3775b/module/zfs/zil.c#L3036

I am away at a conference, but if anyone particularly wants a patch to allow setting 0 to disable this, I probably could find time to write it tomorrow.

Thinking about it some more, maybe we should have a logbias=low-latency setting that would allow this to be disabled at the dataset level. A possibly better possibility would be to devise a way to auto detect whether a single threaded workload is being done and turn it off for a while when the workload is entirely single threaded.

admnd commented 1 year ago

Actually, I have an idea for a possible cause. fsync operations are slower on ZFS than on XFS by default. If the audit log is being stored on ZFS, the difference in fsync speed could explain the difference, since the auditing will block every audited syscall on a fsync operation

TL;DR: Your theory seems to be verified.

A bit more details:

High contention also occurs if I put the auditd logfile on a different ZFS dataset within the same pool, the same if I put it on a different pool or if I add a SLOG device (for the sake of the cause, I used a ramdisk not a real SSD) to the pool.

However, putting the auditd logfile on something that is not ZFS related (e.g. a ramdisk with an ext4 filesystem) as you suggested herebefore does not trigger this lock contention issue and compilation times are back to normal, just like if auditd has no rules to process or is not launched.

What is very interesting is that, in my case, the auditd logfile has only 5-6 lines of events logged. So even if auditd discards the events (thus logs nearly nothing in its logfile), it seems to force a fsync() or do something else that brings a high level of lock contention in.

Why this particular package (Samba) triggers the issue and some others do not? Good question. This should be dug deeper. The traces mentioned in my original message measure only a number of occurrences at the kernel level, they do not show what is going on within auditd or GCC per se.

EDIT: Typos.

admnd commented 1 year ago

So its possible this hardware just isn't generating enough load (only four cores, slow disk) and so the lock is never contended. But also I don't know auditd well; maybe I didn't even turn it on right. Might be a difference in the kernel. Lots of other variables.

Indeed, I have behind the pool 3x 2TB NVMe modules and I use "-j32" (7950X has 32 logical cores). I was able to reproduce the issue with 3 iSCSI LUNs in RAIDZ over a 10GbE network link between my workstation and the NAS. Not sure if the issue is visible with 2.5GbE or 1GbE network links.

So, looking at the traces, we can see that we're spending a lot of time trying to get extended attributes (__vfs_getxattr), and from there, directory entries (zfs_get_xattrdir and zfs_dirent_lock). And, we're coming through zpl_prune_sb a lot, which seems to be a helper to reclaim unused inodes and directory entries when the ARC is under pressure for metadata space.

This was my first guess (the maximum ARC size here is unlimited and it can grows). So revisiting this with a pre-allocated 16GB ARC rather than the default (8 GB) changes absolutely nothing the contention is also here :(

arcstat -a 1
    time  hits  miss  read  hit%  miss%  dhit  dmis  dh%  dm%  phit  pmis  ph%  pm%  mhit  mmis  mread  mh%  mm%  arcsz  size     c   mfu   mru  mfug  mrug  eskip  el2skip  el2cach  el2el  el2mfu  el2mru  el2inel  mtxmis  dread  pread  grow  need  free  avail  waste
12:53:43    25     0    25   100      0    25     0  100    0     0     0    0    0     0     0      0    0    0   2.3G  2.3G   16G    15    10     0     0      0        0        0      0       0       0        0       0     25      0     1     0  118G   114G    24M
12:53:44   138     0   138   100      0   138     0  100    0     0     0    0    0   136     0    136  100    0   2.3G  2.3G   16G   138     0     0     0      0        0        0      0       0       0        0       0    138      0     1     0  118G   114G    24M
12:53:45   16K     7   16K    99      0   16K     7   99    0     0     0    0    0  5.7K     7   5.7K   99    0   2.3G  2.3G   16G   16K   681     0     0      0        0        0      0       0       0        0       0    16K      0     1     0  118G   114G    24M
12:53:46   19K     3   19K    99      0   19K     3   99    0     0     0    0    0  4.4K     3   4.4K   99    0   2.3G  2.3G   16G   18K    84     0     0      0        0        0      0       0       0        0       0    19K      0     1     0  118G   114G    24M
12:53:47   20K     2   20K    99      0   20K     2   99    0     0     0    0    0  5.5K     2   5.6K   99    0   2.3G  2.3G   16G   19K   561     0     0      0        0        0      0       0       0        0       0    20K      0     1     0  118G   114G    24M
12:53:48   18K     2   18K    99      0   18K     2   99    0     0     0    0    0  4.1K     2   4.1K   99    0   2.3G  2.3G   16G   18K    31     0     0      0        0        0      0       0       0        0       0    18K      0     1     0  118G   114G    24M
12:53:49   16K     3   16K    99      0   16K     3   99    0     0     0    0    0  5.2K     3   5.2K   99    0   2.3G  2.3G   16G   16K   323     0     0      0        0        0      0       0       0        0       0    16K      0     1     0  118G   113G    24M
12:53:50   721     1   722    99      0   721     1   99    0     0     0    0    0   581     1    582   99    0   2.3G  2.3G   16G   399   322     0     0      0        0        0      0       0       0        0       0    722      0     1     0  118G   113G    24M
12:53:51   14K     9   14K    99      0   14K     9   99    0     0     0    0    0  5.2K     9   5.2K   99    0   2.3G  2.3G   16G   14K   833     0     0      0        0        0      0       0       0        0       0    14K      0     1     0  117G   113G    24M
12:53:52  9.7K     4  9.8K    99      0  9.7K     4   99    0     0     0    0    0  4.6K     4   4.6K   99    0   2.4G  2.4G   16G  9.3K   421     0     0      0        0        0      0       0       0        0       0   9.8K      0     1     0  118G   113G    24M
12:53:53  6.1K     5  6.1K    99      0  6.1K     5   99    0     0     0    0    0  2.8K     5   2.8K   99    0   2.3G  2.3G   16G  5.6K   552     0     0      0        0        0      0       0       0        0       0   6.1K      0     1     0  118G   113G    24M
12:53:54  3.2K     6  3.2K    99      0  3.2K     6   99    0     0     0    0    0  2.5K     6   2.5K   99    0   2.3G  2.3G   16G  2.6K   587     0     0      0        0        0      0       0       0        0       0   3.2K      0     1     0  117G   113G    24M
12:53:55   806     0   806   100      0   806     0  100    0     0     0    0    0    21     0     21  100    0   2.3G  2.3G   16G   806     0     0     0      0        0        0      0       0       0        0       0    806      0     1     0  117G   113G    24M
12:53:56   858     0   858   100      0   858     0  100    0     0     0    0    0   126     0    126  100    0   2.3G  2.3G   16G   853     5     0     0      0        0        0      0       0       0        0       0    858      0     1     0  117G   113G    24M
12:53:57  1.0K     0  1.0K   100      0  1.0K     0  100    0     0     0    0    0   289     0    289  100    0   2.3G  2.3G   16G   953    58     0     0      0        0        0      0       0       0        0       0   1.0K      0     1     0  117G   113G    24M
12:53:58  1.0K     0  1.0K   100      0  1.0K     0  100    0     0     0    0    0   120     0    120  100    0   2.3G  2.3G   16G   988    17     0     0      0        0        0      0       0       0        0       0   1.0K      0     1     0  117G   113G    24M
12:53:59   734     0   734   100      0   734     0  100    0     0     0    0    0     0     0      0    0    0   2.3G  2.3G   16G   734     0     0     0      0        0        0      0       0       0        0       0    734      0     1     0  117G   113G    24M
12:54:00   805     0   805   100      0   805     0  100    0     0     0    0    0    24     0     24  100    0   2.3G  2.3G   16G   805     0     0     0      0        0        0      0       0       0        0       0    805      0     1     0  117G   113G    24M
12:54:01   940     0   940   100      0   940     0  100    0     0     0    0    0   161     0    161  100    0   2.3G  2.3G   16G   916    24     0     0      0        0        0      0       0       0        0       0    940      0     1     0  117G   113G    24M
12:54:02  4.5K     2  4.5K    99      0  4.5K     2   99    0     0     0    0    0  1.4K     2   1.4K   99    0   2.3G  2.3G   16G  4.0K   529     0     0      0        0        0      0       0       0        0       0   4.5K      0     1     0  117G   113G    24M
12:54:03  3.8K     0  3.8K   100      0  3.8K     0  100    0     0     0    0    0   874     0    874  100    0   2.3G  2.3G   16G  3.8K    28     0     0      0        0        0      0       0       0        0       0   3.8K      0     1     0  117G   113G    24M
12:54:04  2.6K     0  2.6K   100      0  2.6K     0  100    0     0     0    0    0   754     0    754  100    0   2.3G  2.3G   16G  2.6K    16     0     0      0        0        0      0       0       0        0       0   2.6K      0     1     0  117G   113G    24M
12:54:05   813     0   813   100      0   813     0  100    0     0     0    0    0   128     0    128  100    0   2.3G  2.3G   16G   798    15     0     0      0        0        0      0       0       0        0       0    813      0     1     0  117G   113G    24M
12:54:06   618     0   618   100      0   618     0  100    0     0     0    0    0    25     0     25  100    0   2.3G  2.3G   16G   617     1     0     0      0        0        0      0       0       0        0       0    618      0     1     0  117G   113G    24M
12:54:07   894     0   894   100      0   894     0  100    0     0     0    0    0   240     0    240  100    0   2.3G  2.3G   16G   817    77     0     0      0        0        0      0       0       0        0       0    894      0     1     0  117G   113G    24M
12:54:08   710     0   710   100      0   710     0  100    0     0     0    0    0    32     0     32  100    0   2.3G  2.3G   16G   704     6     0     0      0        0        0      0       0       0        0       0    710      0     1     0  117G   113G    24M

Either way, I'm betting you have xattr=on on the dataset, and setting it to xattr=sa will help (it defaults to on). zfsprops(7) has more detail.

I tested that also, no joy, issue still present :)

admnd commented 1 year ago

The baseline configuration for auditd is this one:

local_events = yes
write_logs = yes
log_file = /var/log/audit/audit.log
log_group = root
log_format = ENRICHED
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
(...)

Now, if I change write_logs = yes for write_logs = no and have a rsyslog running => contention. As soon as rsyslog is brought down, contention is gone.

I also tried flush=NONE but no improvements.