Open njuettner opened 1 week ago
Additionally we are using -w
which is known for bad performance
-w /usr/bin/docker -p rwxa -k docker
-w /var/lib/docker -p rwxa -k docker
-w /etc/docker -p rwxa -k docker
-w /etc/systemd/system/docker.service.d/10-giantswarm-extra-args.conf -p rwxa -k docker
-w /etc/systemd/system/docker.service.d/01-wait-docker.conf -p rwxa -k docker
-w /usr/lib/systemd/system/docker.service -p rwxa -k docker
-w /usr/lib/systemd/system/docker.socket -p rwxa -k docker
-a always,exit -F arch=b64 -S execve -F key=auditing
-a always,exit -F arch=b32 -S execve -F key=auditing
-k key Set a filter key on an audit rule. This is deprecated when
used with watches. Convert any watches to the syscall form
of rules. It is still valid for use with deleting or
listing rules.
-w path
Place a watch on path. If the path is a file, it's almost
the same as using the -F path option on a syscall rule. If
the watch is on a directory, it's almost the same as using
the -F dir option on a syscall rule. The -w form of
writing watches is for backwards compatibility and is
deprecated due to poor system performance. Convert
watches of this form to the syscall based form. The only
valid options when using a watch are the -p and -k.
We need to migrate those using:
-a always,exit -F dir=/usr/bin/docker -S all -F key=docker
-a always,exit -F path=/etc/systemd/system/docker.service.d/10-giantswarm-extra-args.conf -S all -F key=docker
...
# Audit execution of Docker binary
-a always,exit -F path=/usr/bin/docker -F perm=x -k docker_exec
# Audit writes to Docker data directory
-a always,exit -F dir=/var/lib/docker -F perm=wa -k docker_data
# Audit changes to Docker configuration
-a always,exit -F dir=/etc/docker -F perm=wa -k docker_config
# Audit changes to Docker systemd configuration files
-a always,exit -F path=/etc/systemd/system/docker.service.d/10-giantswarm-extra-args.conf -F perm=wa -k docker_systemd_config
-a always,exit -F path=/etc/systemd/system/docker.service.d/01-wait-docker.conf -F perm=wa -k docker_systemd_config
# Audit changes to Docker service and socket files
-a always,exit -F path=/usr/lib/systemd/system/docker.service -F perm=wa -k docker_service
-a always,exit -F path=/usr/lib/systemd/system/docker.socket -F perm=wa -k docker_socket
Another idea would be changing the auditd.conf
:
q_depth = 8192 # Increase buffer size
max_log_file = 50 # Increase max log file size
num_logs = 10 # Increase number of logs
freq = 100 # Decrease log writing frequency
overflow_action = SUSPEND # Suspend to prevent overflow
Currently q_depth
is not set which also might be a reason why we see those errors:
Error receiving audit netlink packet (No buffer space available)
Auditd seems to integrated by @giantswarm/team-atlas, but not sure what the reason behind was: https://github.com/giantswarm/k8scloudconfig/releases/tag/v16.5.0
@QuentinBisson do you remember why?
After talking with Quentin as a quick solution for Vintage cluster which are affected:
Applying k8s-initiator-app on the nodepools where jenkins is running
Removing
rm /etc/audit/rules.d/99-default.rules
and reloading the rules without those
augenrules --load
For CAPI we would need to integrate a toggle which enables auditd when needed but should be disabled by default, PR to disable it: https://github.com/giantswarm/cluster/pull/325
CAPI: fixed (auditd is disabled by default and can be enabled at anytime) but we need new CAPA and CAPZ releases
Vintage: k8s-initiator-app
is not working for removing audit rules. Main issue is auditctl
and the dependency mess, it requires a reload because they're kept in-memory, so removing just the files isn't enough. See Slack thread: https://gigantic.slack.com/archives/C062HB29BDG/p1725542379870169
For Vintage we exhausted our options getting around a new release, so we need a new v20 release.
Prepare a new aws-operator
release which includes a new annotation to toggle auditd
. Should be enabled by default because one customer already relies on it, so we don't mess around IMHO.
Auditd
is included in the k8scloudconfig.
@T-Kukawka Could Phoenix start working on it next week please? I'm off the next days otherwise I would jumped in
For tracking: Adidas issue
@T-Kukawka it looks like we don't need to it. It was a final test and it seems we can get around doing a new vintage release: https://gigantic.slack.com/archives/C062HB29BDG/p1726086709037459?thread_ts=1725542379.870169&cid=C062HB29BDG
Daniel figured out setting hostPid: "true"
might solve it.
Slack Thread: https://gigantic.slack.com/archives/C6L8J93N0/p1724419948903269
TL;DR
Slowed Jenkins operations take longer, generating even more audit events over time.
Context
After upgrading release v19.3.0 to v20.1.2
to
Customer noticed a heavy impact on node pools where Jenkins Agents are running. Those nodes were becoming ultra slow. We were able to identify that writing audit messages is the bottleneck:
We were identifying the audit rules to track what the system is doing:
This rule audits all program executions (via the execve system call) on 64-bit systems. It’s a broad rule that captures when any program is run.
Similar to the previous rule, but for 32-bit systems. It also audits all program executions.
When running Jenkins it happens that nodes becoming unresponsive for seconds
Example:
When flushing all audit rules the node becomes instantly responsive again:
We're still not sure why this happens now it might be that Jenkins is executing now more commands or auditd has been changed since the last release.