linux-audit / audit-kernel

GitHub mirror of the Linux Kernel's audit repository
https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git
Other
138 stars 36 forks source link

Q: audit queue backlog residual issues #31

Closed rgbriggs closed 7 years ago

rgbriggs commented 7 years ago

Catch-all for any remaining audit queue backlog issues. See: https://bugzilla.redhat.com/show_bug.cgi?id=1129013 https://github.com/linux-audit/audit-kernel/issues/23

pcmoore commented 7 years ago

FYI: I'm marking this as a "question" for right now since it is unclear how we should label this "meta issue" at the moment. As we discuss these issues we can relabel/reprioritize as needed.

The immediate need was to get @rgbriggs' concerns recorded somewhere so we wouldn't collectively lose them, and this issue tracker accomplishes that need, but I think our first order of business should be to break this issue apart into separate issues that can be tackled in reasonable chunks. I'm fine if multiple concerns get put into a single issue, perhaps they do all belong in the same issue tracker after all, but I think some thought/discussion should happen on this matter.

rgbriggs commented 7 years ago

This was the intent. Some may even be dismissed as not an issue or not planning to fix. It was already shorter than I was expecting. :)

pcmoore commented 7 years ago

Heh, I can't remember the last time any ToDo list ever came up shorter than I expected ;)

pcmoore commented 7 years ago

Quickly revisiting this issue to see where things stand in light of the big changes in v4.11:

RFE: audit default backlog_wait_time setting Add audit_pid to the while loop conditions in audit_log_start() https://bugzilla.redhat.com/show_bug.cgi?id=1296189

I'm really not sure this is an issue, especially with the mutex owner check in audit_log_start(). If we want to keep this concern alive, let's work to turn this into something a bit more measurable.

Ensure we don't have any special problems with systemd shutting down auditd, filling or blocking the queue. I don't like the idea of hard-coding a pid. Another method would be preferable. https://www.redhat.com/archives/linux-audit/2015-October/msg00071.html

On my test system, with the changes in v4.11, the system shuts down with little delay even under a massive audit load. I think we can consider this resolved.

Potentially allow audit_cmd_mutex holders to bypass, override or jump to the head of the queue. https://www.redhat.com/archives/linux-audit/2015-October/msg00073.html

The v4.11 changes allow the mutex owner to skip any backlog restrictions; I think we can consider this resolved.

Wake up waiting tasks when auditd goes away and put them on the hold queue audit: wake up audit_backlog_wait queue when auditd goes away https://www.redhat.com/archives/linux-audit/2015-October/msg00074.html https://www.redhat.com/archives/linux-audit/2015-November/msg00028.html

The auditd_reset() function, which is called to disconnect the kernel/auditd connection, flushes the main and retry queues to the hold queue, but we don't explicit wake anyone waiting on the main backlog.

(just noticed a bug: by flushing the main queue we bypass the multicast send)

audit: wake up kauditd_thread after auditd registers https://www.redhat.com/archives/linux-audit/2015-October/msg00075.html

Done. Although the kauditd_thread() kernel thread should be active whenever there is anything on the main backlog queue so this shouldn't be an issue anymore.

pcmoore commented 7 years ago

(just noticed a bug: by flushing the main queue we bypass the multicast send)

Created #41 to track this.

rgbriggs commented 7 years ago

[how do you automated quote and thread replies in the comment section?]

Quickly revisiting this issue to see where things stand in light of the big changes in v4.11:

RFE: audit default backlog_wait_time setting

I think this has been addressed by your code restructuring.

Add audit_pid to the while loop conditions in audit_log_start() https://bugzilla.redhat.com/show_bug.cgi?id=1296189

I'm really not sure this is an issue, especially with the mutex owner check in audit_log_start(). If we want to keep this concern alive, let's work to turn this into something a bit more measurable.

Again, I'm not sure this is still a problem, but would have to look at it again with a fresh eye.

Ensure we don't have any special problems with systemd shutting down auditd, filling or blocking the queue. I don't like the idea of hard-coding a pid. Another method would be preferable. https://www.redhat.com/archives/linux-audit/2015-October/msg00071.html

On my test system, with the changes in v4.11, the system shuts down with little delay even under a massive audit load. I think we can consider this resolved.

The last time you had looked at this, it sounded like the delay had doubled from 60s to 120s. What is the shutdown delay now? Is it under 5s?

Potentially allow audit_cmd_mutex holders to bypass, override or jump to the head of the queue. https://www.redhat.com/archives/linux-audit/2015-October/msg00073.html

The v4.11 changes allow the mutex owner to skip any backlog restrictions; I think we can consider this resolved.

I had tried that fix previously. I agree this should be resolved.

Wake up waiting tasks when auditd goes away and put them on the hold queue audit: wake up audit_backlog_wait queue when auditd goes away https://www.redhat.com/archives/linux-audit/2015-October/msg00074.html https://www.redhat.com/archives/linux-audit/2015-November/msg00028.html

The auditd_reset() function, which is called to disconnect the kernel/auditd connection, flushes the main and retry queues to the hold queue, but we don't explicit wake anyone waiting on the main backlog.

So this issue is still outstanding. Is there any reason to keep these tasks waiting when there is no longer an audit daemon, rather than immediately add these messages to the hold queue if there is any room left in it?

(just noticed a bug: by flushing the main queue we bypass the multicast send)

audit: wake up kauditd_thread after auditd registers https://www.redhat.com/archives/linux-audit/2015-October/msg00075.html

Done. Although the kauditd_thread() kernel thread should be active whenever there is anything on the main backlog queue so this shouldn't be an issue anymore.

Ok.

pcmoore commented 7 years ago

The last time you had looked at this, it sounded like the delay had doubled from 60s to 120s. What is the shutdown delay now? Is it under 5s?

As I said, this is fixed with the latest v4.11 changes; there is no noticeable shutdown delay caused by audit even under massive load.

Is there any reason to keep these tasks waiting when there is no longer an audit daemon, rather than immediately add these messages to the hold queue if there is any room left in it?

To be clear, we should never immediately add the records to the hold queue (this is what issue #41 is about).

Regarding keeping the processes waiting, I'm not sure if this is really something we need to worry about as the bottom of kauditd_thread() is going to wake those tasks anyway (the only reason they are sleeping is because the backlog had grown past the limit). I think we really should just let the normal queue processing do its job here and let the backlog clear normally.

Thoughts?

rgbriggs commented 7 years ago

On 2017-04-19 13:35, Paul Moore wrote:

The last time you had looked at this, it sounded like the delay had doubled from 60s to 120s. What is the shutdown delay now? Is it under 5s?

As I said, this is fixed with the latest v4.11 changes; there is no noticeable shutdown delay caused by audit even under massive load.

So less than 2 seconds? That's great.

Is there any reason to keep these tasks waiting when there is no longer an audit daemon, rather than immediately add these messages to the hold queue if there is any room left in it?

To be clear, we should never immediately add the records to the hold queue (this is what issue #41 is about).

Ok, fair enough.

Regarding keeping the processes waiting, I'm not sure if this is really something we need to worry about as the bottom of kauditd_thread() is going to wake those tasks anyway (the only reason they are sleeping is because the backlog had grown past the limit). I think we really should just let the normal queue processing do its job here and let the backlog clear normally.

Ok, again, fair enough.

Thoughts?

If I notice problems, I'll investigate again.

-- Richard Guy Briggs rgb@redhat.com Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635

pcmoore commented 7 years ago

I think we can close this out now, what do you think @rgbriggs ?

rgbriggs commented 7 years ago

@pcmoore Yes, I think we can. There has been enough change that it is difficult to be sure, but we can re-open this or file a new issue if we find something.