Closed stevenbrz closed 2 years ago
Part of the motivation for this change is that it can be difficult to deploy something like this in a large scale environment when one of the failure modes can completely hang the system. Having a permissive mode available that's lower risk makes this easier to adopt, debug, and deploy.
A long tine ago, it operated like you are suggesting. The problem is that when it gets to assessing the rules, the application can be gone. It was causing errors and spurious failures. So, to get accurate results, it now unconditionally approves the request after the rules engine has run.
There are plans for more performance work to make it run faster. (The latest release should be a little better.) Also, if you are using sha256 integrity, switch to size - just for collecting logs. And, if you can use file-libs-5.42, it is significantly faster due to improvements in regex handling.
Makes sense - going to close this out in favor of a PR just containing the bugfix: https://github.com/linux-application-whitelisting/fapolicyd/pull/193.
I'm curious what kind of errors can happen in the non-blocking mode? fanotify returns an open file handle, and this PR doesn't close the file handle until after assessing the rules, so I'm curious what you mean by 'can be gone'.
Context: we run some large mesos instances and have been struggling to get fapolicyd running without affecting system stability (even with integrity=non, a single allow all rule, and tuned cache/queue sizes), this patch is running stable for us and we haven't encountered any errors yet.
Everything on the subject side of a rule, except the PID, comes from opening a couple files in /proc/pid That is where the bulk of the problems comes.
fanotify
supportsFAN_OPEN
andFAN_OPEN_EXEC
which do not block on receiving a decision. For our use case of havingfapolicyd
produce logs which we then can analyze asynchronously, this can boost performance significantly.In addition when testing, we noticed that if the internal event queue fills, we do not close the event's file descriptor, which results in the process accumulating them over time if the queue size is not properly configured.
(co-authored by @kenbreeman)