linux-application-whitelisting / fapolicyd

File Access Policy Daemon
GNU General Public License v3.0
192 stars 55 forks source link

Configurable metric reporting interval #284

Closed jw3 closed 4 months ago

jw3 commented 7 months ago

Provides recurring metric reports at a specified interval.

Uses a nonblocking timerfd alongside pthread_cond_timedwait to create timed reporting intervals. The interval is configured through the report_interval key in fapolicyd.conf. This value represents the length of interval in seconds between reports. If set to 0 reporting intervals are disabled, this is the default.

Reports are written to the fapolicyd.state file. The output format and contents of this file are unchanged by this PR. This is the same file that was written at shutdown and upon receiving SIGUSR1. The signal handling function remains unchanged.

Error handling is based around disabling interval reporting to preserve the decision thread. When an unrecoverable reporting error occurs the interval reports are disabled and the decision thread will use the non-interval report handler for the remainder of it's execution.

The structuring of the reporting loop gives priority to the fan handling loop. When a reporting interval expires, the fan handler runs prior to writing the report to prioritize the fan handling and also to ensure the cache stats are as recent as possible. Also the reporting loop is only entered when no fan events have been recorded.

This PR also restructure the initialization of the decision thread and reporting loop to ensure a fresh fapolicyd.state file when fapolicyd starts. The previous behavior was that the file from the last shutdown was still present until either a SIGUSR1 or shutdown.

stevegrubb commented 7 months ago

Hmm. If you send SIGUSR1 to fapolicyd, the results are in the file /run/fapolicyd/fapolicyd.state.

jw3 commented 7 months ago

Ah, :eyes:, thanks @stevegrubb.. thought that was only done at shutdown.

Is there value in having the decision log interleaved with cache stats? This PR would allow logging that type of consolidated timeline.

I'm also a little concerned with adding noise to the logs when using SIGUSR1, (1) when bumping the event loop to ensure stats are not stale (see fapolicyd-cli ~line 850) and (2) when reading the contents of the state file.

The noise from both of these can be filtered out, though that creates a tricky case where log entries are interesting but match the filter, eg. profiling a profiler.

Are there any technical issues with the implementation provided in this PR?


Ill take a look at revising the cli logic, adding the new command doesn't seem right.

stevegrubb commented 7 months ago

"fapolicyd-cli --check-status" does the SIGUSR1 thing and then displays the resulting file. It does not dump the LRU's. That only happens at shutdown and if fapolicyd.conf has do_stat_report = 1, which is the default. When set it to 0, you get just the metrics on shutdown. But no matter the value, SIGUSR1 is a concise report.

jw3 commented 7 months ago

Thanks for clarifying that @stevegrubb. I may have been using the term stats incorrectly. My stats term usage was based on lru.c function dump_queue_stats.

So with that said, I'm not looking to use values from the fapolicyd-access.log report. My only interest is in the metrics written to the fapolicyd.state file.

Those metrics are also logged at DEBUG level on shutdown. I was hoping they could also be logged upon request during runtime, that would allow application to continuously poll the cache state during runtime.

The fallback if this PR cannot merge, is to poll with the SIGUSR1 signal, which is feasible. The downside of the fallback is the file access operations of (1) a contrived file read to bump the event loop (see fapolicyd-cli ~line 850), and (2) reading the metrics from disk; each of which adds irrelevant file accesses, ie. noise, to the access log.

jw3 commented 7 months ago

Background in fapolicy-analyzer issue

stevegrubb commented 7 months ago

I like the idea of making periodic metrics available. There are systems like statsd/prometheus/collectd/fluentd/etc that could help gain visibility into how the daemon (and by extention the system) is performing. By sending this to syslog it is not trivial to dig out the last occurrence. The intention of the SIGUSR1 interface was to have an on-demand way of seeing the daemon's internal metrics. The output is short and always the latest status. This would be ideal for collecting metrics.

The fapolicyd-cli app looks at the pid file, checks what it points to, if it is fapolicyd, it sends a sigusr1. Fapolicyd receives the signal on the inbound/main thread. The report is run by the decision thread so that it is serialized with any updates to the metrics. If the queue is empty, it sleeps indefinitely on a conditional wait. In order to wake it up, the cli app accesses a file (the fapolicyd config file). This can skew the statistics for the access report. (This is only written when the daemon exits and has no effect on performance data.) I suppose we can change the file it accesses, but to which one?

It might be possible to trigger the conditional wait from the signal handler or the main loop without needing a file access. This would let the report run immediately without skewing access statistics. We might need to do something in the decision thread to be OK if the queue is still empty.

Another possibility would be to add a configuration item for periodic reports. What this would do is setup a periodic timerfd which could be polled. When the timer fires, run a new report. This would avoid the need to create an access which skews statistics. The serialization of the report still needs to be handled in the decision thread. This might still need the conditional wait work mentioned in the previous paragraph. Then plugins can setup a notification on the report changing to trigger a re-read/parsing.

jw3 commented 7 months ago

Another possibility would be to add a configuration item for periodic reports. What this would do is setup a periodic timerfd which could be polled. When the timer fires, run a new report. This would avoid the need to create an access which skews statistics. The serialization of the report still needs to be handled in the decision thread. This might still need the conditional wait work mentioned in the previous paragraph.

Yes, this sounds right.

I made an attempt at sketching that out in this PR, it has sharp edges but is a working example. It needs some more isolated testing. I have a lot of events flowing on my system, so hard to say for sure if the pthread_cond_timedwait is working as expected. I'll test in isolation some and see what it looks like.

fapolicyd-cli --check-status remains functional.

stevegrubb commented 7 months ago

Thanks. I'll look over the changes in the coming week.

jw3 commented 7 months ago

Ok thanks. It seems to be working as expected, both when testing in isolation on a single directory and also when running it on my full system.

radosroka commented 4 months ago

@jw3 do you have anything to add? If not we could probably merge it.

jw3 commented 4 months ago

@radosroka I think it's ready to merge.

radosroka commented 4 months ago

@radosroka I think it's ready to merge.

Would you mind squashing the commits into one?

jw3 commented 4 months ago

@radosroka it is squashed and rebased :+1:

jw3 commented 4 months ago

rebased on @stevegrubb's latest commits

Resolved a few minor conflicts in notify.c, nothing significant.