Closed rgbriggs closed 7 years ago
A first stab at a test case, doesn't compile: https://github.com/rgbriggs/audit-testsuite/commit/cd8e0e04bf145196b73241cf84562daeff6e9a76
The "Post:" field was lost in the original report above, maybe the Markdown parser ate it?
It's funny this comes up, when I was playing with the multicast/queue code I wondered if there were any issues around updating audit_pid and audit_sock without some sort of locking protection. Ultimately I decided it probably wasn't a realistic issue since we hadn't encountered any problems in the several years (if ever), so I shelved the thought for another day. I haven't looked at this in much detail yet, but would protecting the audit_* connection tracking variables with a RCU/spinlock combo be a workable fix here?
As for the test, we definitely could use a reproducer to test/debug this, but I'm not sure this is something we would want to merge into the audit-testsuite.
@pcmoore Thanks for pointing out the missing URI. I found another that was eaten too. Post Reproducer
I now have the test binary compiling. Now need to update the test script. I have not yet tried to trigger the error.
Why do you think a reproducer is not a good idea to merge into the audit-testsuite?
Ate it again... I thought humans were distinct from other species in being able to learn from their mistakes... first proactive reset patch
Second version of a potential fix:
Why do you think a reproducer is not a good idea to merge into the audit-testsuite?
The biggest objection is that it could cause a kernel GPF/panic if run on a buggy kernel; while I recognize that is always a risk, I don't want to merge code that knowingly panics the system if the bug hasn't been patched.
My other concern gets more to the nature and purpose of the audit-testsuite tests. The idea behind this testsuite was to have a simple, easy to run regression test that could be used to validate upstream development and demonstrate that basic functionality remained intact across kernel RC releases, and to provide a simple reproducer when something broke. It turns out the testsuite has proven useful in other areas, but it is still not a stress test or a collection of tests/reproducers for every audit bug. We could add a test to verify proper behavior when auditd disconnects, that would be appropriate, but probably not what you were looking for in the context of this issue.
V3 potential fix: https://lkml.org/lkml/2016/12/13/329
@pcmoore Where would you put a destructive test like the socket fuzzer so that it does get run periodically in a stress test and isn't forgotten and potentially never run again?
Upstream pull request: https://www.redhat.com/archives/linux-audit/2016-December/msg00090.html
For the record, Linus merged the pull request, this particular issue was fixed with 533c7b69c764ad5febb3e716899f43a75564fcab.
Where would you put a destructive test like the socket fuzzer so that it does get
run periodically in a stress test and isn't forgotten and potentially never run again?
I don't believe we have a good answer for this at the moment. Let's close out this issue (it's solved now) and continue the test discussion in https://github.com/linux-audit/audit-testsuite/issues/29.
A shutdown race of audit_sock appears to be causing a GPF when audit_replace tries to get sock_sndtimeo after audit_sock has been reset.
Reported-by: Dmitry Vyukov dvyukov@google.com Post: Reproducer: The following program triggers GPF in sock_sndtimeo: Eric Dumazet: Looks a bug added in commit 32a1dbaece7e37cea415e03cd426172249aa859e ("audit: try harder to send to auditd upon netlink failure") or 133e1e5acd4a63c4a0dcc413e90d5decdbce9c4a ("audit: stop an old auditd being starved out by a new auditd") Cong Wang: It is racy on audit_sock, especially on the netns exit path.