Closed arichardson closed 3 years ago
I tried to debug this and include a fix, but so far I've failed to narrow it down further than the poll
call in epollfd_ctx_wait
sometimes failing. It also only reproduces 1/10 times and so far never when ran under truss
/ktrace
Thanks for the test! It looks like we are hitting a race in the FreeBSD kernel. I've managed to reproduce it with kqueue
:
#ifdef NDEBUG
#undef NDEBUG
#endif
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/event.h>
#include <assert.h>
#include <errno.h>
#include <signal.h>
#include <stdlib.h>
#include <poll.h>
#include <unistd.h>
int
main(void)
{
int rv;
sigset_t set;
rv = sigemptyset(&set);
assert(rv == 0);
rv = sigaddset(&set, SIGUSR1);
assert(rv == 0);
rv = sigprocmask(SIG_BLOCK, &set, NULL);
assert(rv == 0);
int skq = kqueue();
assert(skq >= 0);
struct kevent kev;
EV_SET(&kev, SIGUSR1, EVFILT_SIGNAL, EV_ADD, 0, 0, 0);
rv = kevent(skq, &kev, 1, NULL, 0, NULL);
assert(rv == 0);
int kq = kqueue();
assert(kq >= 0);
EV_SET(&kev, skq, EVFILT_READ, EV_ADD | EV_CLEAR, 0, 0, 0);
rv = kevent(kq, &kev, 1, NULL, 0, NULL);
assert(rv == 0);
for (;;) {
rv = kill(getpid(), SIGUSR1);
assert(rv == 0);
/* Turn this into `#if 1` to avoid the race. */
#if 1
rv = kevent(kq, NULL, 0, &kev, 1, NULL);
#else
rv = kevent(kq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
#endif
assert(rv == 1);
rv = kevent(kq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
assert(rv == 0);
rv = kevent(skq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
assert(rv == 1);
rv = kevent(skq, NULL, 0, &kev, 1, &(struct timespec) { 0, 0 });
assert(rv == 0);
siginfo_t siginfo;
rv = sigtimedwait(&set, &siginfo, &(struct timespec) { 0, 0 });
assert(rv == SIGUSR1);
rv = sigtimedwait(&set, &siginfo, &(struct timespec) { 0, 0 });
assert(rv < 0);
assert(errno == EAGAIN);
}
}
skq
is the "signalfd", essentially -- a kqueue with EVFILT_SIGNAL
registered.kq
is the "epollfd"So the notification from the kill(getpid(), SIGUSR1);
call makes its way through both skq
and kq
. Polling kq
directly after the kill
may result in nothing when the signal wasn't "fast" enough.
I think I recall from having looked at the FreeBSD kernel sources some time ago that this happens because various parts of the involved signal/notification mechanisms can be scheduled on different CPU cores. So it isn't 100% guaranteed that right after the kill(getpid(), SIGUSR1);
call all listeners have been notified.
I'm curious if Linux guarantees that the signalfd
/epoll
equivalent of the above code will always work, or if the race window is so short that it will always "appear" to work correctly.
I'm curious if Linux guarantees that the
signalfd
/epoll
equivalent of the above code will always work, or if the race window is so short that it will always "appear" to work correctly.
I would assume that at least a single-thread program always sees the signal immediately (even if it was migrated to another CPU between the kill and epoll_wait call). Similarly I'd also assume it to be visible immediately for a signal sent to the current thread using pthread_kill/thr_kill/tkill. Have you by any chance reported this race in the FreeBSD bug tracker?
Have you by any chance reported this race in the FreeBSD bug tracker?
No, I haven't -- but I'm also not sure if this is an actual bug in the kernel or "by design". One might interpret this behavior as violating the "spirit of POSIX", because they say:
If the value of pid causes sig to be generated for the sending process, and if sig is not blocked for the calling thread and if no other thread has sig unblocked or is waiting in a sigwait() function for sig, either sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns.
Of course there is no kqueue/epoll in POSIX, but to me it reads that all side effects of the signal should be observable after the kill() returned.
I merged your test case, but I changed it so that it won't fail outright, but only skip. I guess this falls under the category of edge cases that cannot be 100% faithfully emulated right now.
Thanks again!
Reported as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=258310 since I'm not sure if this is intended or an implementation bug.
Fantastic! I'll update the text in the atf_tc_skip
accordingly. I guess skipping the test is better for now, because I don't know yet how the other BSD's handle this scenario.
The wayland event_loop_signal test fails non-deterministically on FreeBSD due to a missing SIGUSR1 notification. This test case is a minimal reproducer for the issue: it passes on Linux but fails when run on FreeBSD.