Spurious resumption of fibers unwinding past their blocking operation

Math2 commented 1 year ago

Failing test case:

diff --git a/test/io/event/selector.rb b/test/io/event/selector.rb
index 88dd994..fb1eca5 100644
--- a/test/io/event/selector.rb
+++ b/test/io/event/selector.rb
@@ -283,6 +283,49 @@ Selector = Sus::Shared("a selector") do
                :select, :readable, :readable
            ]
        end
+
+       it "can handle exception raised during wait from another fiber that was waiting on the same io" do
+           [false, true].each do |swapped| # Try both orderings.
+               writable1 = writable2 = false
+               error1 = false
+               raised1 = false
+               
+               boom = Class.new(RuntimeError)
+               
+               fiber1 = fiber2 = nil
+               
+               fiber1 = Fiber.new do
+                   begin
+                       selector.io_wait(Fiber.current, local, IO::WRITABLE)
+                   rescue boom
+                       error1 = true
+                       # Transfer back to the signaling fiber to simulate doing something similar to raising an exception in an asynchronous task or thread.
+                       fiber2.transfer
+                   end
+                   writable1 = true
+               end
+               
+               fiber2 = Fiber.new do
+                   selector.io_wait(Fiber.current, local, IO::WRITABLE)
+                   # Don't do anything if the other fiber was resumed before we were by the selector.
+                   unless writable1
+                       raised1 = true
+                       fiber1.raise(boom) # Will return here.
+                   end
+                   writable2 = true
+               end
+               
+               fiber1.transfer unless swapped
+               fiber2.transfer
+               fiber1.transfer if swapped
+               selector.select(0)
+               
+               # If fiber2 did manage to be resumed by the selector before fiber1, it should have raised an exception in fiber1, and fiber1 should not have been resumed by the selector since its #io_wait call should have been cancelled.
+               expect(error1).to be == raised1
+               expect(writable1).to be == !raised1
+               expect(writable2).to be == true
+           end
+       end
    end

    with '#io_read' do

I tried to explain what I think is going on in the PR #64.

But I think I got it a little wrong wrt the EWOULDBLOCK problems it could lead to. Because even though the fibers might get resumed sooner than they should, the EWOULDBLOCK should still be dealt with correctly afterward (IIUC). It's not just an ordinary "spurious wake up" issue like that. Because like you said they have to be dealt with anyway in certain contexts. If you get an event that isn't really true (which can always happen on most OSes apparently), if you're still expecting this kind of event, you can deal with it (and CRuby does that, it deals with EWOULDBLOCK errors and waits/retries). But in this case, the selector can resume fibers in a way that they don't expect, while they're in a different context. The fibers could be waiting on a different kind of event now. Or waiting on some arbitrary user logic that would resume them manually somehow.

For example, this crashes here (on 3.1 and with IO_EVENT_SELECTOR=EPoll). CRuby must not be happy to have its fiber resumed as if it had IO ready when it was really asking if a process had exited.

#!/usr/bin/env ruby

require 'socket'
require 'async'

Sync do
  s1, s2 = Socket.pair(:UNIX, :STREAM)

  pid = Process.spawn("sleep 10")

  t1 = Async do
    s1.wait_readable
  ensure
    system 'sleep 10'
  end

  t2 = Async do
    s1.wait_readable
    t1.stop
  end

  s2.write('x')
end

The problem doesn't just happen when waiting on IO too, it could have been the other way around. A fiber waiting on a process, unwinding and no longer waiting on it, and then being resumed by the selector as if the process had exited while the fiber was now waiting on something else.

What's annoying is that fixing this would probably require to rework all of the selectors (except Select which always dealt with this correctly with its Waiter objects IIUC) and make them more complicated...

ioquatix commented 1 year ago

Thanks for this, I'll take a look today.

ioquatix commented 1 year ago

In epoll.c you can enable DEBUG:

enum {
    DEBUG = 1,
};

Then we can get some more detailed logs:

epoll_flags_from_events events=1 flags=1073741849
<- fiber=0x7f6227f7fae0 descriptor=7
epoll_flags_from_events events=1 flags=1073741849
<- fiber=0x7f6227f7f158 descriptor=7
-> ptr=0x7f6227f7f158 events=1
io_wait_transfer errno=17
io_wait_transfer flags=1
events_from_epoll_flags flags=1
-> ptr=0x7f6227f7fae0 events=1
/home/samuel/Projects/socketry/io-event/test.rb:17: [BUG] Segmentation fault at 0x0000000000000028

In theory, when unwinding, the event should be removed from EPoll. It does not look like that is happening in this case, so I'll review it in more detail.

To run your test.rb with the local build, it must be like this:

#!/usr/bin/env ruby

$LOAD_PATH << ::File.expand_path("ext", __dir__)
require_relative 'lib/io/event'

require 'socket'
require 'async'

Sync do
  s1, s2 = Socket.pair(:UNIX, :STREAM)

  pid = Process.spawn("sleep 10")

  t1 = Async do
    s1.wait_readable
  ensure
    system 'sleep 2'
  end

  t2 = Async do
    s1.wait_readable
    t1.stop
  end

  s2.write('x')
end

And run it like this:

> IO_EVENT_SELECTOR=EPoll bundle exec ./test.rb

ioquatix commented 1 year ago

I understand the problem now.

The issue is, the events are already in the application buffer and EPOLL_CTL_DEL effectively has no impact since the events are already generated.

The way to fix this, which I was trying to avoid, is to use an explicit handle which can be cancelled, rather than storing the fiber directly into the event data.ptr itself. The handle can get cancelled, which does not affect the epoll itself.

This is the approach polyphony takes. I did know about this race condition, but couldn't come up with a way to expose it. I think because this test is pretty straight forward, we do need to move to a handle based approach, which sucks, because it requires extra allocations.

I'll think about whether there are any alternative approaches we could take to avoid this.

Math2 commented 1 year ago

The way to fix this, which I was trying to avoid, is to use an explicit handle which can be cancelled, rather than storing the fiber directly into the event data.ptr itself. The handle can get cancelled, which does not affect the epoll itself.

This is the approach polyphony takes. I did know about this race condition, but couldn't come up with a way to expose it. I think because this test is pretty straight forward, we do need to move to a handle based approach, which sucks, because it requires extra allocations.

I'll think about whether there are any alternative approaches we could take to avoid this.

Those "handles" could deal with multiple fibers waiting on the same IO too (and eliminate the dup() hack in EPoll and EV_UDATA_SPECIFIC in KQueue (which is OS X-only).

I think another way could be to have the fibers directly remove themselves from the selector's events buffer (if it is currently processing one). Directly in the array of epoll_event/kevent structs I mean. When the selector starts iterating, it would point a global (per-selector?) pointer variable to the buffer (and set it to NULL when it's done iterating). Then when a fiber unwinds, it checks the buffer to see if it's in it. And marks its entry as disabled (by setting the user data pointer to NULL I guess). And then the selector skips it if it encounters it. If the buffer isn't too big (64 events in EPoll/KQueue's cases) a linear search wouldn't be too bad. And it would only need to be done when they unwind due to an exception.

You mentioned linking together waiter objects allocated directly on the fibers' stacks too. But maybe it's not necessary at all.

I tried to come up with more tests BTW but it's pretty hard to get something that reliably demonstrates something and it's getting more confusing the more I try. I think the one in this PR exposes the root cause of it though.

ioquatix commented 1 year ago

Thanks for all those ideas, it's really helpful to have a 2nd brain working on the problem.

In terms of performance, the only implementation I actually really care about is io_uring. Everything else can sacrifice for correctness, including kqueue. In fact, I believe io_uring is the only interface capable of implementing every operation asynchronously, anyway, and it's the one we expect most (99%+) in production environments going forward. So, I'm okay with sacrificing a little performance for correctness.

Maybe it's impossible, but, my general model for the design has been to try and keep the individual fibers as isolated as possible, to avoid a combinatorial explosion of program states - when fibers can share a single epoll event, things could potentially get very tricky - e.g. different sets of events, etc. Because, AFAIK, this is a non-issue with io_uring, my gut feeling is just to aim for the simplest and most correct implementation in all cases. In other words, I'm less interested in trying to be clever for the sake of performance, because I feel the chance of bugs is much higher.

In summary, the cost of bugs at this level is significantly more than the value of performance at this point in time, especially for anything other than io_uring.

Regarding tests, the test.rb you provided is actually really decent and reproduces the problem reliably. We can extrapolate on that if needed.

Math2 commented 1 year ago

No problem! Glad to be helping fixing the bugs/limitations. I always tried to avoid multi-threaded programming like the plague. And non-blocking fibers really seem to me like how things always should have worked. But they're hard to get right, you need like a in-process mini-OS to manage them...

Yeah AFAIK to go "full asynchronous" with kqueue (and epoll), you have to use AIO for vnode descriptors. But that's a huge hassle and almost no one does it.

ioquatix commented 1 year ago

I'm running into another issue.

At what point can we guarantee kqueue/epoll is done with the user data (token)?

It almost seems like we'd need a reference count...

Because there is "fiber-side" cancellation, as well as "select-side" completion. If the select side has already completed it, it can't deallocate it if the fiber side might still need to cancel, etc.

For example, if you close the file descriptor while events are in flight, I assume they never resolve, in which case you'd never know to release the underlying resources/token.

ioquatix commented 1 year ago

I think the correct solution is a reference count... damn.

Math2 commented 1 year ago

Not sure I fully understand the design yet. But, I think you could "cancel" the tokens in the fibers, when they unwind. But "release" them in the selector, after a full pass over the received event. You'd do a second pass just to free them. First pass: resume the fibers with non-cancelled tokens. Second pass, release the tokens.

ioquatix commented 1 year ago

Ah, yes, a two pass approach might avoid the need for reference counting, nice idea. After the first pass, all the tokens should be effectively consumed. I think we still have to deal with cancellation too thought, let me see what I can come up with.

ioquatix commented 1 year ago

During cancellation, on the fiber side, we can ask, e.g. epoll_ctl(EPOLL_CTL_DEL) -> ENOENT / success. ENOENT means the token was not in epoll, and success means it was. In the case of ENOENT, we don't know if that's because it's still in the selector side list waiting to be processed, or has been processed. If it is still in the list, we can wait for the selector to release it, otherwise we should release it, but we can't tell which. If we try to deallocate the token because epoll_ctl -> ENOENT, it might still be in use by the selector = crash.

Math2 commented 1 year ago

Oh I think I see the problem now. Yeah, what if the fibers unwind for whatever reason (and unregister themselves from the event mechanism) before the selector could process their events (and thus free their token). So the fibers have to be able to release their tokens themselves. But then they could release their tokens before the selector was done with them...

I think I'd have a "pinned" flag to the tokens. In the selector, first pass pins tokens, second pass dispatches, third pass unpins and releases tokens. Fibers only release if not pinned. And that's reference counting pretty much, just with only one reference possible. Yeah I wonder if there's a better way...

ioquatix commented 1 year ago

In my experience, trying to make a general purpose interface with OS interfaces like epoll is insanely difficult due to the fact that almost any combination of behaviour is possible. If you were directly implementing a client/server etc, you'd just avoid these cases as part of the technical implementation. Because we can't rule out any odd combination of behaviour, in the end, reference counting may be the only option.

For io_uring and kqueue (with unique udata), there is a little more predictability, but not much.

ioquatix commented 1 year ago

I've been playing around with this over the weekend.

I'm not much closer to a solution.

It seems to me, without some "token"-based approach for event registration, there is a chance of spurious wake ups no matter what.

We can cancel an event from EPoll, but it might already be in the list of processing events, as discussed.
KQueue "rw" io_wait is actually 2 filters. In theory, both could occur in the same call to kevent and this could resume the same fiber twice.
io_uring has better semantics around cancellation, but might still have CQEs during cancellation the same as Epoll.

I started working through the implementation of the token-based approach but it's looking almost impossible to get right, for example with kevent, we need to discretely add and remove individual filters to correctly reference count the token.

I'm also concerned about whether non-token based approaches can work - i.e. the proposed approach of storing pending events to be processed, and scanning them on cancellation, assumes that if the event is not pending (e.g. in the CQE list), it can be cancelled. I'm not sure this is always true.

There are two ideas I think might work:

Some kind of serial number stored per-event. When the fiber is resumed we validate that the event is still relevant. This is hard to implement without having an extra field in the user data per event, as we need to store both the fiber AND some kind of serial number.
Validate the event still matches what the fiber was expecting and only resume it in that case. This assumes that the event data contains enough information to make that determination, and I'm not 100% confident that's the case.

Using a token is the most generally correct option, but we have to correctly reference count it, i.e. we need to know exactly how many times it's referenced by the OS in order to implement it correctly. One specific challenge here is dealing with close and how it impacts the selector (e.g. it can clear internal events).

Scanning pending events during cancellation might also work, but for kqueue, due to using multiple filters, would actually require us to potentially scan the pending event list for every io_wait.

ioquatix commented 1 year ago

I took a look at how libev works.

Essentially they have a per-selector array of size == maximum fd, and each entry is a linked list of watchers. Addition and removal is O(N) but we can assume N is normally pretty small.

void
ev_io_start (EV_P_ ev_io *w) EV_NOEXCEPT
{
  int fd = w->fd;

  if (ecb_expect_false (ev_is_active (w)))
    return;

  assert (("libev: ev_io_start called with negative fd", fd >= 0));
  assert (("libev: ev_io_start called with illegal event mask", !(w->events & ~(EV__IOFDSET | EV_READ | EV_WRITE))));

#if EV_VERIFY >= 2
  assert (("libev: ev_io_start called on watcher with invalid fd", fd_valid (fd)));
#endif
  EV_FREQUENT_CHECK;

  ev_start (EV_A_ (W)w, 1);
  array_needsize (ANFD, anfds, anfdmax, fd + 1, array_needsize_zerofill);
  wlist_add (&anfds[fd].head, (WL)w);

See the last line in the above code. It does not look like this list is ever resized and only ever expands as higher FDs are encountered.

This avoids life time issues by simply never removing entries for valid FDs. Once a file descriptor is registered for an event, the "token" is valid forever. However, this only works for system calls that use file descriptors.

Math2 commented 1 year ago

I spent a while thinking about it. And I looked at the selector implementations. Mostly KQueue, since I'm most familiar with it. But EPoll is very similar. And I hope that anything that can be made to work with those can work with io_uring too, since it's more flexible. But I don't know much about it.

I have an idea, but I'm not 100% clear on everything... I'm sure there are other complications I'm not thinking of but I think it works and it could even be made to be not too complicated.

First, the close() issue seems really hard to deal with. And not having a clear way of dealing with it makes it much harder to think about the rest of the design. So I started from there.

If a fiber has its waited on FD close(2)'d by another fiber, I think it just cannot end well. IIUC, select(2)/poll(2) can fail the syscall as a whole, maybe even report events for a different IO object (if the FD is reallocated). kqueue deletes pending events and stops polling the FD. But the epoll manpages say that it only reacts to closed FDs once the underlying kernel file it points to had its last reference closed. And it only mentions the "interest list", not the "ready list". You can manually remove individual FDs, but if they get closed by another fiber that won't happen. And even if this could be dealt with and there's no FD confusion and the in-kernel pending events list gets properly pruned, it's still a big problem that we'll never get any event for it anymore and the waiting fiber won't ever be resumed. Not by the selector anyway. It'll just stall there.

I might be wrong but I think that trying to design the selectors to correctly deal with this is impossible. The only solution is to not let the situation happen in the first place.

Not sure know how realistic this is, but I think that to deal with this, the #close on Ruby-level IO objects would have to be hooked into. And make it do what it must to purge it from the event mechanism, and resume the corresponding waiting fiber (with an IO exception, since its waited on IO object became invalid). It's why closing an IO object unexpectedly only does limited damage on Ruby. It's because you don't work directly with FDs. The IO object knows it got closed. You don't have to worry about accidentally using another FD, or stalling forever. But to keep it working like that with an asynchronous event system, #close is going to have to get more complicated...

And to make this work, the fibers, their polled IO objects, and the pending events, would all have to be linked together in some intermediate data structure. They'd be kind of like the Waiter objects that Select has. Unfortunately it's more complicated than the tokens but I think it's what it takes to deal with #close'd IO objects.

It's possible to attach arbitrary instance variables to IO objects and use them from CRuby extensions right? I think every IO object should act as a kind of "hub" for the fibers that wait on them. The fibers would be queued in a doubly linked list (with the nodes allocated on their stacks). Let's call those nodes "Waiters". They'd work like for the Select selector's Waiters pretty much. You'd poll every IO object only once, no matter how many fibers are waiting on it. You'd point the events' user data pointers directly to IO objects, and iterate over its Waiters when they're triggered. And since the waiters are available directly from the IO objects, #close would be able to find them and deal with the queued fibers. When a fiber unwinds, it unlinks its Waiter from the queue, and if the queue becomes empty, it cancels the polled event. When the IO is #close'd, it raises an exception on all of the waiting fibers and lets them unwind.

If that takes the #close issue out of the way, everything becomes easier to reason about. The fibers register/unregister the IO object with the event mechanism before they block and when they unwind, respectively. It should be the only way it can happen. And they always know when they're the last fiber to be waiting on a particular event-type for a given FD, it's all in the same queue.

No reference counting I think. Just the usual "safe" linked-list iteration trick for the loop when you dispatch triggered events to the queued fibers to let the fibers safely unlink themselves even if they're the iteration's current node.

ioquatix commented 1 year ago

There is a disabled scheduler hook for IO#close, but it's difficult to use, because anyone can write IO.for_fd(n).close and that isn't necessarily going to hook into the scheduler correctly.

However, I think your general idea makes sense. I'll think about it more this week. I might write some test code for epoll to see exactly how it behaves in the case of close, as I'm not sure the documentation is clear enough as to what to expect.

socketry / io-event

Spurious resumption of fibers unwinding past their blocking operation #65