socketry / io-event

MIT License
66 stars 15 forks source link

Spurious wakeups #66

Closed ioquatix closed 1 year ago

ioquatix commented 1 year ago

As discussed in https://github.com/socketry/io-event/issues/65, it's possible for the list of events in IO_Event_Selector_EPoll_select to contain cancelled operations. If during cancellation, another operation is scheduled, a spurious wakeup can be generated. Adding one layer of indirection, in the form of cancelled tokens, avoids this situation, since each operation has a unique token which can independently be cancelled.

The overhead of this is extra memory allocations, but these can be pooled. We could compare the cost of malloc. There may be little point pooling these tokens.

Fixes #65.

Types of Changes

Contribution

ioquatix commented 1 year ago

@Math2 I've implemented what I think is needed for epoll. This also avoids the dup hack by maintaining a separate list of waiting fibers (and the event they are waiting on). The next step is to try this out for kqueue and possibly uring if needed.

The array implementation is split out so we can manage the pooling of element allocations later on if it proves to be a performance issue. Unfortunately using a direct array of structures causes too much pain when reallocation invalidates existing pointers to elements of the array. Another option is to use a multi-level array, e.g. an array of chunks of 1024 descriptors or something. All the implementation can be kept in array.h to keep things organised.

Are you interested in trying to update the kqueue implementation? Otherwise I can have a go at it this weekend. By using a similar approach, we may not need to use the unique user data hack which is unavailable on BSD.

ioquatix commented 1 year ago

Note that the kernel version of GitHub actions is too old for io_uring, so it's still failing and I'll fix that separately, I still have some hope we don't need such a complex implementation and can have a better fast path.

ioquatix commented 1 year ago

https://github.com/axboe/liburing/discussions/918 - discussion around how pedantic we need to be w.r.t. io_uring cancellations.

ioquatix commented 1 year ago

@Math2 I have some time today, so I'm going to start working on the kqueue implementation.

ioquatix commented 1 year ago

kqueue is now migrated to use the new system.

I ended up combining pid and fd (ident) in the descriptor array. It makes me think it might be a bad idea, but 100,000 entries in that array is at most about a 1MB so it's not too bad.

Math2 commented 1 year ago

Hi. Sorry for late reply. I haven't had time to look into details yet, or do a lot of testing, but I'm going to. Fixing this was more complicated than I thought. It would have taken me forever if I had tried to do even just the KQueue part... This code is really doing something complex... with lots of details to get right.

The logic in IO_Event_Selector_EPoll_handle()/IO_Event_Selector_KQueue_handle() and associated functions seem good to me. But I'm really not used to circular linked lists so I'm having a hard time following it and confirming that the iteration is safe. But it looks like it should be.

If the "waiting" list nodes never pointed to the head/tail node in the "descriptor" in the array (by implementing a different kind of linked lists), the array could be allowed to move. But that would require to redo the linked list implementation (and callers would have to explicitly pass a pointer to the head/tail node in the descriptor for certain operations).

Maybe the "descriptor" struct could be dropped entirely. There would be no special head/tail node for the list and the FD-indexed array would point directly to some "waiting" node. And when removing a node you'd check if that's the one the FD array points to and set it to NULL if so. And the union "events" mask would be recalculated every time it's needed (there shouldn't be a lot of fibers waiting on the same IO, and if there are you have to walk the whole list to wake them up anyway). And when dispatching events, you'd look up the list by FD by the array instead of using user data pointers. I think this would allow to keep the circular list implementation.

What happens if a pid and a fd value collide in the array? It SEEMS like it might actually be safe, since it should be impossible for the same fiber to wait on both a fd and a pid... If collisions are safe, I was thinking that the pids could be masked to reduce their range to turn the array into some kind of hash table, but I think that the collisions in-between pids it could cause would not be safe. Separate arrays is probably the simplest way to deal with it.

With this patch KQueue works on FreeBSD (it was just being a little bit finicky). Hopefully it doesn't break OS X though. Unfortunately OpenBSD/NetBSD don't have EVFILT_USER. I'm going to look at what could be done for them.

diff --git c/ext/extconf.rb i/ext/extconf.rb
index a535127..8d5bfab 100755
--- c/ext/extconf.rb
+++ i/ext/extconf.rb
@@ -50,9 +50,9 @@ if have_header('sys/epoll.h')
    $srcs << "io/event/selector/epoll.c"
 end

-# The order matters, because we MUST have EV_UDATA_SPECIFIC.
+# The order matters, because we MUST have EVFILT_USER.
 # The `have_header` call is just to add the -D to the compiler flags.
-if have_const('EV_UDATA_SPECIFIC', 'sys/event.h') and have_header('sys/event.h')
+if have_const('EVFILT_USER', 'sys/event.h') and have_header('sys/event.h')
    $srcs << "io/event/selector/kqueue.c"
 end

diff --git c/ext/io/event/selector/kqueue.c i/ext/io/event/selector/kqueue.c
index bdbe077..c266cff 100644
--- c/ext/io/event/selector/kqueue.c
+++ i/ext/io/event/selector/kqueue.c
@@ -823,8 +823,7 @@ VALUE IO_Event_Selector_KQueue_wakeup(VALUE self) {
        struct kevent trigger = {0};

        trigger.filter = EVFILT_USER;
-       trigger.flags = EV_ADD | EV_CLEAR | EV_UDATA_SPECIFIC;
-       trigger.fflags = NOTE_TRIGGER;
+       trigger.flags = EV_ADD | EV_CLEAR;

        int result = kevent(selector->descriptor, &trigger, 1, NULL, 0, NULL);

@@ -832,6 +831,16 @@ VALUE IO_Event_Selector_KQueue_wakeup(VALUE self) {
            rb_sys_fail("IO_Event_Selector_KQueue_wakeup:kevent");
        }

+       // FreeBSD apparently only works if the NOTE_TRIGGER is done as a separate call.
+       trigger.flags = 0;
+       trigger.fflags = NOTE_TRIGGER;
+       
+       result = kevent(selector->descriptor, &trigger, 1, NULL, 0, NULL);
+       
+       if (result == -1) {
+           rb_sys_fail("IO_Event_Selector_KQueue_wakeup:kevent");
+       }
+       
        return Qtrue;
    }

Also, I think that EV_ONESHOT should be used when polling IO. As it is, there's nothing that makes it stop polling if the fibers don't make the condition disappear (ex, they don't read the whole input, and then stop waiting on it) and it can end up busy-looping.

diff --git i/ext/io/event/selector/kqueue.c w/ext/io/event/selector/kqueue.c
index c266cff..8941350 100644
--- i/ext/io/event/selector/kqueue.c
+++ w/ext/io/event/selector/kqueue.c
@@ -252,7 +252,7 @@ int IO_Event_Selector_KQueue_arm(struct IO_Event_Selector_KQueue *selector, uint
    if (events & IO_EVENT_READABLE) {
        kevents[count].ident = ident;
        kevents[count].filter = EVFILT_READ;
-       kevents[count].flags = EV_ADD | EV_ENABLE;
+       kevents[count].flags = EV_ADD | EV_ENABLE | EV_ONESHOT;
        kevents[count].udata = (void *)kqueue_descriptor;

 // #ifdef EV_OOBAND
@@ -267,7 +267,7 @@ int IO_Event_Selector_KQueue_arm(struct IO_Event_Selector_KQueue *selector, uint
    if (events & IO_EVENT_WRITABLE) {
        kevents[count].ident = ident;
        kevents[count].filter = EVFILT_WRITE;
-       kevents[count].flags = EV_ADD | EV_ENABLE;
+       kevents[count].flags = EV_ADD | EV_ENABLE | EV_ONESHOT;
        kevents[count].udata = (void *)kqueue_descriptor;
        count++;
    }
Math2 commented 1 year ago

I did some improvements and added a test case that triggered the busy-loop in this branch: https://github.com/Math2/io-event/commits/spurious-wakeups

The test case also exposes some issues that apply to EPoll.

In IO_Event_Selector_*_handle(), after a fiber is resumed, the events it was waiting on are added to waiting_events. But it might no longer be waiting on those events. So the selector will keep polling for those events. But there might not be any fibers still waiting on them at that point.

With EPoll, it'll just stop polling for them on the NEXT iteration. And the test fails because of that, it dispatches events just one more time than necessary. It's not a big deal but it'd be better to fix it if only to keep things easier to understand. But also, I think the loop could miss some events that should go in waiting_events for fibers that add themselves to the list during the iteration. The loop starts from tail, and fibers add themselves at tail, so it won't see the newly added fibers (and that's necessary to avoid endless loops since some fibers will always re-add themselves after being resumed). Better gather the waiting_events in a second pass after fibers have stopped messing around.

With KQueue, there's no EV_DELETE call anywhere so it just kept polling forever. I think EV_ONESHOT here does the trick because read/write events are handled separately. Once we got a read event, we'll necessarily have resumed all of the fibers that were waiting on reads after the loop since they're all linked together, so we can stop polling on reads altogether, and EV_ONESHOT does that for us automatically (and if fibers want to read again, they'll request a new one-shot event before blocking). So KQueue can be simplified. Since we never explicitly cancel events, it's still possible to get a stray one-shot event, but this should be harmless since it'll be checked against the waiting list, which is always up-to-date. With EPoll, EPOLLONESHOT would affect events for both reads and writes and it could make us miss events, so it has to keep track of things more carefully.

ioquatix commented 1 year ago

Thanks - feel free to make a PR targeting this branch if that's easy for you, it will be easy for me.

ioquatix commented 1 year ago

Lookup the descriptor by FDs/PIDs to avoid dangling pointers if descriptors move.

This should never happen as the actual descriptor struct is malloced separately from the array, which can grow in size and be realloced. Did you observe some problems?

Math2 commented 1 year ago

Did you observe some problems?

Ah. No. I mistakenly thought that they were allocated directly in the array and that they could move when the array is reallocated... Yeah, 9adbd25ec515dbbcb5f51202cdaa90feb8e40912 should be dropped. Should I make a pull request for that?

I found some other problem that I really can't figure out... This script fails with both EPoll and KQueue.

require 'async'
Sync do
    10.times do
        fail unless `echo test` == "test\n"
        fail unless $?.success?
        warn $?.inspect
    end
end

It works on main branch io-event. On this branch, it stalls after the first iteration with EPoll, and on KQueue most of the time some iteration fails because $? is nil.

With strace on Linux, it seems that pidfd_open() isn't even called in the iteration before it fails. With truss/ktrace on FreeBSD, everything looks like it should work, the kevent() returns that the process has exited, but the wait4() sometimes returns 0 for the pid. I can't make sense of this...

ioquatix commented 1 year ago

There are two more things I want to fix:

  1. Explicitly track #select calls and prevent re-entrancy. This prevents invalid usage.
  2. Extend the waiting struct to support the ready events rather than sending it to the fiber as an argument. This ensures that wakeups that are unrelated to the waiting event can explicitly detect this condition.

Regarding your failing case, can you please add it as a test case to this PR? Thanks! I've added you as a maintainer if that's helpful, otherwise just make a PR and set this branch as the target.

ioquatix commented 1 year ago

With KQueue, there's no EV_DELETE call anywhere so it just kept polling forever.

Yeah, I may have missed that.

The goal is to avoid using oneshot.

A common pattern looks like this:

while data = io.read
  process(data)
end

By checking the waiting AFTER returning from the fiber, we might avoid re-arming the event in the OS, i.e. it reduces the number of system calls. With one-shot, we need to re-arm the event (one system call) per iteration of the loop. But with lazy removal, we can mostly just avoid it, and if an event comes that we are not interested in any more, we can remove it at that point.

ioquatix commented 1 year ago

EVFILT_USER if not present, then kqueue can use IO_Event_Interrupt helper. See interrupt.h and how it's used in epoll.c.

Math2 commented 1 year ago

Regarding your failing case, can you please add it as a test case to this PR?

I couldn't come up with a test case that uses io-event API. I'll try again by messing with fibers, but so far it only happens within async tasks. And not even 100% reliably. And I can't explain it at all... Maybe it's a CRuby bug somehow, no idea...

By checking the waiting AFTER returning from the fiber, we might avoid re-arming the event in the OS, i.e. it reduces the number of system calls. With one-shot, we need to re-arm the event (one system call) per iteration of the loop. But with lazy removal, we can mostly just avoid it, and if an event comes that we are not interested in any more, we can remove it at that point.

Oh OK yeah I think I get it...

My change to EPoll might be wrong then. I still think waiting_events should be gathered in a second pass, but it's not wrong to keep the events for the resumed fibers armed (like they were before). I'll look into it.

EVFILT_USER if not present, then kqueue can use IO_Event_Interrupt helper. See interrupt.h and how it's used in epoll.c.

Nice. I'll plug that in and test on OpenBSD/NetBSD once things work as they should on Linux/OS X/FreeBSD.

ioquatix commented 1 year ago

My change to EPoll might be wrong then. I still think waiting_events should be gathered in a second pass, but it's not wrong to keep the events for the resumed fibers armed (like they were before). I'll look into it.

I'll take a look, but I don't think 2 passes is needed. Before enumerating, we can set x_descriptor->events to 0 and then when the fiber wants to re-arm, it just enables it. However, I also need to think about it some more.

ioquatix commented 1 year ago

Okay, I think I'm happy with the epoll.c implementation now. I'm keeping track of the waiting_events and registered_events so we can do the minimal update.

ioquatix commented 1 year ago

I found some other problem that I really can't figure out... This script fails with both EPoll and KQueue.

Async is part of the external test suite of io-event so if you want to add the test to async, it will be run as part of this PR.

ioquatix commented 1 year ago

I believe your test script is working on this branch now - at least I tested it and it seemed okay to me.

ioquatix commented 1 year ago

This is enough of an improvement to merge. There are still several issues to follow up on: