linux-application-whitelisting / fapolicyd

File Access Policy Daemon
GNU General Public License v3.0
192 stars 55 forks source link

Stop blocking signals in other threads #276

Closed rmetrich closed 8 months ago

rmetrich commented 9 months ago

This is a continuation of #273

Blocking signals in all other threads but main thread doesn't work, because it leads to crashing then having the kernel coredump handler be called from crashing threads, without calling the main thread signal handler. This is especially true with signals being raised internally, such as SIGSEGV or SIGBUS.

The solution consists in not blocking signals anymore, but raising the signal again for main thread in case the signal handler is running from another thread.


Without the patch (reproducer)

Tested on RHEL9 with 1.3.2-100.el9 The hang can be reproduced through injecting SIGBUS using systemtap, as if the RPMDB was crashing (this is based on real customer scenario).

  1. Start stap injector

    # yum -y install systemtap
    # stap-prep
    # stap -v -g -d /usr/sbin/fapolicyd --ldd -e 'probe process("/usr/sbin/fapolicyd").function("rpm_load_list@library/rpm-backend.c") { if (tid() == pid()) next; printf("%ld (%s): sending SIGBUS\n", tid(), execname()); raise(%{SIGBUS%}) }'
    [...]
  2. Reload RPMDB

    # fapolicyd-cli -u
    Fapolicyd was notified
    --> hang
  3. Confirm stap injected SIGBUS

    1746 (fapolicyd): sending SIGBUS
  4. Collect blocked tasks from hypervisor (QEMU/KVM)

    $ virsh send-key fapolicy9 KEY_LEFTALT KEY_SYSRQ KEY_W

    Console shows blocked tasks: fapolicy9_unpatched.log We can see main thread (PID 1745) and other threads hanging on exit, crashing thread (PID 1746) calling coredump generator (do_coredump). We can also see kernel coredump generator (kworker/u4, PID 1761) hanging waiting on fanotify event handling.


With the patch

  1. fapolicyd crashes and restarts
  2. no hang happens
  3. Killing another thread than main thread externally (using kill -BUS <PID>) doesn't hang either
stevegrubb commented 9 months ago

It's a basic tenet in multi-threaded programming that all signals must be handled by one thread to have deterministic performance. Not doing so causes other, hard to diagnose problems.

rmetrich commented 8 months ago

It's a basic tenet in multi-threaded programming that all signals must be handled by one thread to have deterministic performance. Not doing so causes other, hard to diagnose problems.

Yes, but it appears this doesn't work at all, I spent several hours on this and a customer hit the issue again even with #273 . Internal deadly signals always go to the thread that rose them, hence if you block the signal for the thread, the signal will raise anyway, the process will die without handler in main thread called, causing system hang in fapolicyd case since task spawning core.pattern will hang waiting on authorization from fapolicyd which is not functional anymore.

stevegrubb commented 8 months ago

Which signal is being delivered to which thread that needs to be handled by that thread? We cannot have signals that have a handler, such as SIGUSR1, being unblocked everywhere. Unblocking everything has caused problems in the past (4 or 5 years ago) and that's why we block them universally. We're not going to unblock everything, but if you identify a minimal patch that solves the exact problem with 1 signal on 1 thread, we'll consider that. I don't want to re-introduce a problem from the past that we solved by blocking signals on all threads that need to be working. (Note that we do not have retries on syscalls that could be interrupted on the decision thread - nor do the libraries.)

rmetrich commented 8 months ago

SIGBUS gets raised when the rpmdb is corrupted and the thread reads the rpmdb, since on customer systems already, and mimic'ed through the proposed reproducer using systemtap. Regarding USR1 and others, external signals always go to the main thread.

I tested this code extensively, from within fapolicyd. Please try testing similarly and you will see signal blocking just doesn't work at all.

stevegrubb commented 8 months ago

Then let's unblock SIGBUS, which wasn't blocked until PR #273, on the thread that caused it. I thought that patch was supposed to fix this problem.

rmetrich commented 8 months ago

The previous patch was indeed supposed to catch deadly signals, but it effectively did only for externally sent signals, not internally raised ones. Please do additional testing, any deadly signal that can be raised internally (SEGV, BUS, ILL and some more probably) need to be trapped or else you just kill the entire system

stevegrubb commented 8 months ago

I did some more review. I found in the posix specification: "Signals which are generated by some action attributable to a particular thread, such as a hardware fault, shall be generated for the thread that caused the signal to be generated." So, what I would say is that all of the signal masks that were added in PR #273, should be reverted since all of those can be specific to a thread. Things that were masked prior to that PR need to stay masked.

In the main thread let coredump_handler still be assigned to the signals so that the other threads can inherit the handler.

Also, if we add #define _GNU_SOURCE before unistd.h, gettid is defined and we do not need to use syscall. As for what the handler should do for a non-main thread, we probably want to stop all processing of that thread (maybe sleep or pthread_join?) which will cause the deadman's switch to activate and terminate the process.

rmetrich commented 8 months ago

OK let me rework this and test extensively.

rmetrich commented 8 months ago

Submitted the new patch as a new commit (will be merged later once everybody agrees on the proposal).

I could confirm the following:

  1. sending external signal such as SIGHUP, SIGUSR1 was getting handled by main thread only
  2. sending SIGBUS externally was getting handled by thread targeted, then redirected to main thread
  3. raising SIGBUS from a thread was getting handled by thread, then redirected to main thread
  4. sending SIGQUIT externally was getting handled by main thread only
stevegrubb commented 8 months ago

Thanks for the updated path. Looks good to me.