linux-audit / audit-kernel

GitHub mirror of the Linux Kernel's audit repository
https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git
Other
138 stars 36 forks source link

Q: investigate calling auditd_reset() from multiple threads #33

Closed pcmoore closed 7 years ago

pcmoore commented 7 years ago

I'm concerned there may be a race issue with multiple threads calling into auditd_reset(), e.g. the kauditd_thread and various processes via audit_log_end() or similar. Investigate this and correct if necessary.

See https://github.com/linux-audit/audit-kernel/issues/30 for more information.

rgbriggs commented 7 years ago

See additional comment at the end of: https://www.redhat.com/archives/linux-audit/2016-December/msg00082.html

pcmoore commented 7 years ago

I'm guessing you are referring to this comment:

I'll post another tested patch, but I'm still not that happy that it does not proactively reset audit_pid, audit_nlk_portid and audit_sock when auditd's socket has a problem. I'll leave the test run overnight.

... ?

If so, that may or may not be an issue, but it is separate concern, or at the very least a concern that should be addressed after the locking question has been answered. If you have a question around variable consistency across multiple threads, adding more code to manipulate the variable before ensuring the variable's integrity isn't a step in the right direction ;)

rgbriggs commented 7 years ago

I was really hoping to delete code by not schlepping around so many state variables, or put them all into one struct...

pcmoore commented 7 years ago

Regardless of how the state is represented, struct/scalar/etc., we need to focus on answering the locking question first.

pcmoore commented 7 years ago

Should be resolved in commit:

commit 5b52330bbfe63b3305765354d6046c9f7f89c011
Author: Paul Moore <paul@paul-moore.com>
Date:   Tue Mar 21 11:26:35 2017 -0400

audit: fix auditd/kernel connection state tracking

What started as a rather straightforward race condition reported by
Dmitry using the syzkaller fuzzer ended up revealing some major
problems with how the audit subsystem managed its netlink sockets and
its connection with the userspace audit daemon.  Fixing this properly
had quite the cascading effect and what we are left with is this rather
large and complicated patch.  My initial goal was to try and decompose
this patch into multiple smaller patches, but the way these changes
are intertwined makes it difficult to split these changes into
meaningful pieces that don't break or somehow make things worse for
the intermediate states.

The patch makes a number of changes, but the most significant are
highlighted below:

* The auditd tracking variables, e.g. audit_sock, are now gone and
replaced by a RCU/spin_lock protected variable auditd_conn which is
a structure containing all of the auditd tracking information.

* We no longer track the auditd sock directly, instead we track it
via the network namespace in which it resides and we use the audit
socket associated with that namespace.  In spirit, this is what the
code was trying to do prior to this patch (at least I think that is
what the original authors intended), but it was done rather poorly
and added a layer of obfuscation that only masked the underlying
problems.

* Big backlog queue cleanup, again.  In v4.10 we made some pretty big
changes to how the audit backlog queues work, here we haven't changed
the queue design so much as cleaned up the implementation.  Brought
about by the locking changes, we've simplified kauditd_thread() quite
a bit by consolidating the queue handling into a new helper function,
kauditd_send_queue(), which allows us to eliminate a lot of very
similar code and makes the looping logic in kauditd_thread() clearer.

* All netlink messages sent to auditd are now sent via
auditd_send_unicast_skb().  Other than just making sense, this makes
the lock handling easier.

* Change the audit_log_start() sleep behavior so that we never sleep
on auditd events (unchanged) or if the caller is holding the
audit_cmd_mutex (changed).  Previously we didn't sleep if the caller
was auditd or if the message type fell between a certain range; the
type check was a poor effort of doing what the cmd_mutex check now
does.  Richard Guy Briggs originally proposed not sleeping the
cmd_mutex owner several years ago but his patch wasn't acceptable
at the time.  At least the idea lives on here.

* A problem with the lost record counter has been resolved.  Steve
Grubb and I both happened to notice this problem and according to
some quick testing by Steve, this problem goes back quite some time.
It's largely a harmless problem, although it may have left some
careful sysadmins quite puzzled.

Cc: <stable@vger.kernel.org> # 4.10.x-
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>