Closed carllerche closed 8 years ago
cc @reem @dwrensha @dpc @rrichardson @Hoverbear @tailhook @zonyitoo @jamwt @polyfractal @geofft @wycats @thijsc @vhbit @fhartwig @alexcrichton @aturon
Thanks for putting this together, @carllerche!
I'm pretty excited about this, definitely thanks for putting it together @carllerche! Allowing essentially arbitrary types implement Evented
sounds like it would indeed very nicely bridge the current Windows implementation and the channel/timer implementation today.
Some thoughts of mine:
Poll
as it seems a tad complicated (with some clear tradeoffs), but timeouts also seem so inherent to the Selector
that I don't think there's really any way we can get away from it.Selector
type which is basically just the I/O selector? I'd be fine hiding it at the beginning at least.Poll
use &self
instead of &mut self
, even given these changes? It'd certainly be nice to enable multi-threaded event loops, but that may not necessarily be required.Overall this all looks really great to me, though, I'm quite excited to see how it pans out!
Oh yeah, that are great changes. And I agree with one of the points by @alexcrichton , which is to use &self
instead of &mut self
to make modifying Pool
thread-safe.
Some APIs explained here seem internal to mio, so excuse if I missed or misunderstood anything.
Mioco already has mioco::Evented
, allows custom types to implement it and Timer
, Channel
are already unified structs implemented on top of Handler::notify
& Handler::timeout
. mio having them in the first place would simplify mioco and make it more aligned API-wise.
mio::Handler
abstraction could be completely removed, and users could call their own tick
after poll
.
I think Chanels should not implement actual queues. They should only carry an atomic notification, while the queue itself is up to the user. That would allow making different versions of them: fixed circular buffer, growable vec, mpsc/mpmc, etc. Of course mio could provide some standard implementations. That would fix #322.
So :+1: from me.
Replying to @alexcrichton
How much of the lock-free manipulation is necessary?
The primary reason why I am pretty focused on this is that the current EventLoop
implementation on Linux has no (user level) coordination at all. I'm very hesitant to add any mutexes or other heavy coordination code in a hot path. Timers and channels are pretty hot.
So, whether mutexes are acceptable for 1.0 is a tough call. Benchmarking something like this is a tricky without a "real world" work load.
It'd be nice to remove the wheel aspect from Poll
It would be, but I am not sure how. Something has to wake up the Selector
when a timeout has expired. If Poll
has a wheel, it can be used to figure out the timeout for waiting on Selector
.
The only other alternate strategy that I can think of would be to spawn a timer thread that has a pipe and writes a byte to the pipe when a timeout expires. This would require having a communication channel to the thread as well to register timeouts.
Are you thinking of exporting a separate Selector type.
It will stay crate private as it is now.
Do you think it'd be appropriate to have Poll use &self instead of &mut self
I have been thinking the same, but I didn't bring it up explicitly. I think that it should be possible Poll
should take &self
.
@dpc
mio::Handler abstraction could be completely removed, and users could call their own tick
This is TBD, but I believe that this may be the best option. There isn't too much utility to having EventLoop
and Handler
around anymore.
I think Chanels should not implement actual queues.
That is the plan. you would do something like mio::Channel::new(queue)
where queue
implements some "queue" trait. So, you would have the ability to specify the queue that you want with the semantics you want.
Timers and channels are pretty hot.
Cool, sounds good to me. I'll browse the PR as well, but I suspect that you're right in that mutexes may just be too heavy for this use case.
It'd be nice to remove the wheel aspect from Poll
It would be, but I am not sure how.
I agree, and the wheel aspect is really just an implementation detail. From an API perspective you just request an event on a delay, which seems super reasonable!
I also agree that spawning a thread to send notifications is basically just a non-starter.
I think that it should be possible Poll should take &self.
The only major API implication I could think of is that you would pass in &mut Events
(basically &mut Vec<Event>
) which gets propagated by the Poll
, but otherwise it seems like it should "Just Work"
The new unified API seems good, but I think the queuing adds a lot of complexity behind the scenes. The proposed alternative seems highly preferable: use timerfd on any platform that supports it, with a compatible implementation on other platforms.
You may not even need a thread and pipe; poll supports a timeout, so just calculate the next timeout and pass that.
Similar mechanisms exist on Windows.
For Channel, on Linux, you can use eventfd. You still need syscalls, but eventfd has far less overhead than a pipe.
Timers and channels are pretty hot.
Just anecdotal support for this, we (Dropbox) have rust daemons that push 20+Gbps between individual machines routinely on a few mio event loops. Every operation must have a timeout, of course, to prevent resource pileup and queuing/congestion collapse, so timers are everywhere.
Based on profiling we've done, the performance of timers and channels is really critical. So I would be concerned about changes that would slow these down in a substantial way.
General reaction from me, I'm concerned about the implementation complexity, but definitely intrigued by being able to control the behavior of the channel more directly.
I also agree having a single "readiness" API is a cleaner abstraction rather than special-casing timers and channels.
Great write-up. I really like the idea. Generalizing the different concepts makes everything a lot easier. I agree that the implementation can be quite complex, but the overall gain seems worth it.
Benchmarks would be nice, since this could potentially have a performance impact.
@jamwt @Ticki
I'm very sensitive to the performance requirements of Mio. I should probably add a more explicit performance analysis section.
Specifically for Linux, I do not believe that there will be any change in performance assuming one channel and one timer per event loop (which is the 99% case).
To port an application from Mio 0.5 to Mio w/ this proposal landed, after creating a Poll
, a single channel and a single timer struct will be constructed. Both will have a single associated allocation (the RegistrationNode
). However, after that, there will be no further allocations on Mio's part.
At runtime, given usual behavior, the timer under heavy load will only queue readiness once per timer tick (default of 100ms). Under heavy load, the channel will queue readiness at most once per call to Poll::poll
, assuming that the channel is drained after every single iteration.
So, on Linux, the usage of the additional logic is negligible compared to a heavy socket load. I do think that benchmarks are needed, but I think that they need to be done in a "real world" environment. Synthetic benchmarks in this case will not be very helpful as, by design, the completion queue on Linux is not in the hot path.
(on a side note, back to @alexcrichton's point about using a Mutex initially, it may be OK as long as all fast pass guards are using an atomic read).
@joshtriplett
Regarding eventfd
, definitely.
Regarding code complexity, I want to emphasize strongly that there is no way to remove this code entirely from Mio. It will always have to exist at the very least for Windows. The question is whether or not it should be provided to all platforms.
Also, to all, I am very interested in your opinions about what to do with EventLoop
.
I very much like the idea of removing EventLoop
in favor of Poll
. I think this will make the library more approachable for beginners, because the interface will more closely resemble system interfaces with which they may already be familiar.
I agree with @dwrensha about removing EventLoop
as well. It seems "more appropriate" in terms of getting closer to what the underlying APIs are it also sidesteps questions about naming/configuration.
I also agree with @dwrensha that using Poll
directly seems more straightforward. Anyone who wants something like this can just create it like you said. I'm down for removing EventLoop
and Handler
.
@joshtriplett
Regarding eventfd, definitely.
Awesome.
Regarding code complexity, I want to emphasize strongly that there is no way to remove this code entirely from Mio. It will always have to exist at the very least for Windows. The question is whether or not it should be provided to all platforms.
I'm not entirely sure about that. I think you can handle all of those cases on Windows using WaitForMultipleObjects in place of poll/epoll. With a socket, you can call WSAEventSelect to get an event you can pass to WaitForMultipleObjects. For a timeout, you can either use the built-in timeout parameter in WaitForMultipleObjects or create a waitable timer object. And for channels, a semaphore should work.
Is there some case those can't cover?
Looks good @carllerche. I think removing EventLoop
and Handler
make sense as well, and just have Poll
with the ready events.
I have few thoughts:
EventLoop
and Handler
in favor of 20 lines of custom code (I believe most direct users of mio would be libraries like rotor and mioco anyway)Overall, I think using poll(timeout)
and Awakener
(eventfd) are clean and good enough abstractions to build event loop on top. And the proposed API creates a temptation to allocate a channel/timer per state machine, which is not the goal, as far as I understand.
Sorry, I'm not familiar with windows code at all. But current Poll
on linux is very simple. I would prefer to keep it as simple as it is on linux.
I don't know if "one timer per event loop" is really 99% of use cases. In mioco each coroutine can use one or more timers. Mioco has a test running million coroutines (each sleeping for a while on it's timer) and it works just fine, so I hope in new design such scenario won't suffer. I guess it's the same as @tailhook 's point 3. It's possible to implement another layer of time wheel in mioco, but I'd be happy to avoid it.
Also, I'd like to make sure there are no fixed size only designs and growable options can still be supported. Eg. in the above tests the wheel size needs to be adjusted, and I was going to fill issue to ask for "just grow the time wheel if needed" setting. I understand the point of "no allocations for performance", but IMO growing instead of crashing is much better proposition for anything except embedded applications. I'm not sure if proposed changes do change anything in that matter, so I'm just making sure.
@dpc To be clear, right now EventLoop
is hard coded to 1 timer. Each timer can handle a very large number of in-flight timeouts.
Another advantage of this proposal is that it exposes the primitives so that if the default mio timer does nto suit the requirements, an alternate timer implementation could be built.
@dpc Also, I disagree strongly about "growing vs crashing" A well designed network application will think about limits. A system that "grows" has a hard limit, but it is undefined and will be hit at an indefinite time. We can agree to disagree, but in my experience it is always better to be explicit about limits when building any robust system.
@carllerche You're probably right about network applications. But networking is not the only area where mio/mioco will be used. In fact I started mioco because I wanted more convenient way to handle unix pipes in colerr. Mio will be a standard to build any Rust io-driven code so it needs to accommodate any requirements. Lets say someone builds a logging part of backup software with mio/mioco. What should the fixed values be? I don't want my backups to fail in 10 years, because my system has now 128x more data to backup, on machine with 8x more threads and 32x more ram and some fixed values somewhere in the code are too small now.
So if possible I'd like to pass the choice of behavior to the end user. And wherever mio uses fixed resources and panics when it runs out of them, I'd be happy to see a flag, that makes it allocate more instead.
@dpc Also, I disagree strongly about "growing vs crashing" A well designed network application will think about limits. A system that "grows" has a hard limit, but it is undefined and will be hit at an indefinite time. We can agree to disagree, but in my experience it is always better to be explicit about limits when building any robust system.
Maybe this discussion is a little bit off-topic, but it's an important one. On the one hand, I do totally agree with you that any system has a limit. On the other hand, it's unclear how to determine the limit.
Let's pretend that it's easy to determine the limit for the number of connections (the limit that is not configured in mio, but currently fixed in rotor, not sure about mioco). This limit is just ulimit -n
. But how do you know which number of timers each connection has?
Fortunately, with the newest design of rotor you have exactly one timer per state machine. So unless you create a state machine per request (or something like this), it's fine to derive the limit. But when there is no such limit (like I in mioco AFAIU), it just impossible to define. What if you activate another configuration option, and there are more timers per state machine? What if timer wheel is large enough for 99.9% of the time but sometimes it crashes? How to distinguish of whether it is some kind of resource leak (like never removed timer), or just not enough [number-of-fsm * number-of-timers-per-fsm] value specified?
So I second growing timer queue solution. Usually, you'll end up with EOM because of running out of memory for buffers not for timers, I think.
If we're going down to primitives here and leaving higher level abstractions to other applications & libraries, I am inclined to agree that hard limits should also be managed at a higher level, given that there are the means to stat the used resources and act accordingly.
Even for general purpose networking applications fixed values are not good enough. Let's say you build a network daemon (say p2p-dropbox syncthing) and distribute it as a binary to the users. How are the fixed values are to be picked? Some people will want to use it to sync couple of files between two systems smaller than raspberry pi's), and some will be syncing terabytes of data every day in huge farm of servers.
Like others, I am apprehensive about the level of unsafety in the implementation (extensive use of raw atomic primitives and raw pointers, atomics mixed with thread-unsafe tools like Cell), but I think this can be reduced by internal implementation organization without affecting the public API.
I conceptually like the proposed API but it's hard to tell in detail; I think the proposal would be clearer if it included some type signatures for the major public methods of the API, for instance Poll::register
, the Evented
trait, and methods on Registration
.
All my other thoughts are basically covered by existing discussion; no need to repeat ourselves.
Well, I think there is a good alternative to this, for windows support.
If we will look closer at a stack "mio > rotor > rotor-stream > rotor-http > basic-http-server" (as an example) we could observe that:
So if we are willing to augment all the three mio, rotor, and rotor-stream, we could have the following benefits:
Another good reason to try this approach is to be able to use another two improvements for applications:
As far as I know both approaches are somewhat completion-based. And while we may accept a mediocre performance on windows, being able to utilize those two at full speed will be super-great. And the full performance of neither of them is possible by replacing just "mio". On the other hand, I believe (though, can't prove) that it's possible to reach top performance in userspace network stack and RDMA by enhancing all three (mio, rotor, rotor-stream) with some compile-time options, leaving every protocol on top of them completely unaware of the optimization.
All of this is to say that it may be better to move windows support to a higher level of abstraction.
@tailhook Sorry, I'm not really following what you are saying.
I've been implementing this proposal on #371 and have been hitting some conceptual issues.
Currently, according to the proposal, Registration
is to be used for both updating interest AND setting readiness. Now, I am thinking that those are probably separate concerns. Poll
is kind of like a consume / produce queue, but where the value being queued is a readiness notification. Updating interest is a consume concern where as setting readiness is a produce side concern. It would make sense to want to do those two things on different threads. Having the operations on a single type makes this difficult.
Second, I am having a hard time unifying setting immediate readiness vs. setting delayed readiness (for the timer). Setting immediate readiness is pretty trivial to implement in a way that is Sync
. Doing so w/ delayed readiness is trickier. In part because I can't think of a use case that makes sense to concurrently set delayed readiness. The semantics of concurrently setting delayed readiness are not clear.
This means that while setting immediate readiness should be concurrent (to implement a channel), setting delayed readiness should not.
The first solution to this that I can think of is to split up Registration
into three types: Registration
(not Sync
), SetReadiness
(Sync
), and SetDelayedReadiness
(not Sync
).
When getting a registration, there would be two constructors:
Registration::new(…) -> (Registration, SetReadiness)
Registration::timed(…) -> (Registration, SetDelayedReadiness
.I believe that this would satisfy the requirements for implementing Timer
and Channel
as described in the proposal. The downside is that there are more types and you have to decide up front if you want SetReadiness
or SetDelayedReadiness
. You can, of course, set immediate readiness w/ SetDelayedReadiness
by setting a delay of 0 and you can also get concurrent access to SetDelayedReadiness
by wrapping it in a mutex.
Thoughts?
Seems very reasonable to split the two types given that the consumer and producer code are likely to be quite different and in different places, and each has different concerns.
I would probably have the Registration constructors return (producer, consumer) not (consumer, producer) though.
Also seems like reasonable tradeoffs to me. Out of curiosity, is Registration
something that could implement Evented
? In theory that's the "thing" you pass to/from the selector with register/reregister/unregister, right?
@alexcrichton interesting thought, I think you are probably correct.
I might be missing something important here, but I strongly believe that there should be no synchronization of any kind in Poll (and mio in general). It should be single threaded (!Sync) abstraction over epoll_wait/kevent.
Everything else can be implemented on top of Poll.poll(). For multithreaded apps you can either dispatch actual work to worker threads or create per-thread event loop. For "worker threads" case you can even use multiple spsc queues (one per thread), that given fast (not std) spsc queue impl will given huge performance boost. For multiple eventloops synchronization is just redundant, coz there're no things to synchronize.
You can (and should) dispatch work to a worker thread, but the worker thread needs a method to send the work back to the IO thread.
Any update on this proposal? Is it officially the way forward? If so, is there an ETA?
Does anyone wish to review the code? I know it is a somewhat large PR.
Unify Sockets, Timers, and Channels
Currently, there are two runtime APIs:
Poll
andEventLoop
.Poll
is the abstraction handling readiness for sockets and waiting for events on these sockets.EventLoop
is a wrapper aroundPoll
providing a timeout API and a cross thread notification API as well as the loop / reactive (viaHandler
) abstraction.I am proposing to extract the timer and the notification channel features into standalone types that implement
Evented
and thus can be used directly withPoll
. For example:The insight is that a lot of the code that is currently windows specific is useful to all platforms. Stabilizing the API and providing it to all platforms allows implementing
Evented
for arbitrary types.Advantages
Disadvantages
The primary disadvantage that I can think of is that the code path around timers & the notification channel become slightly more complicated. I don't believe that the change would have a meaningful performance impact.
There is also additional code complexity for all platforms. However, this code complexity already exists for Windows.
Behavior
An
Evented
would mirror the behavior of a socket registered with epoll. Specifically, in a single threaded environment:poll
.Poll
is permitted to fire of spurious readiness events except if the value has been dropped.In the presence of concurrency, specifically readiness being modified on a different thread than
Poll
, a best effort is made to preserve these semantics.Implementation
This section will describe how to implement a custom
Evented
type as well as Mio's internals to handle it. For simplicity and performance, customEvented
types will only be able to be registered with a singlePoll
.It is important to note that the implementation is not intended to replace FD polling on epoll & kqueue. It is meant to work in conjunction with the OS's event queue to support types that cannot be implemented using a socket or other system type that is compatible with the system's event queue.
Readiness Queue
Poll
will maintain an internal readiness queue, represented as a linked list. The linked list head pointer is anAtomicPtr
. All of the nodes in the linked list are owned by thePoll
instance.The type declarations are for illustration only. The actual implementations will have some additional memory safety requirements.
Registration
When a
MyEvented
value is registered with the event loop, a newRegistration
value is obtained:Registration
will include the internalEventSet::dropped()
event to the interest.Re-registration
A
Registration
'sinterest
&PollOpt
can be changed by callingRegistration::update
:The
Poll
reference will not be used but will ensure thatupdate
is only called from a single thread (the thread that owns thePoll
reference). This allows safe mutation ofinterest
andopts
without synchronization primitives.Registration
will include the internalEventSet::dropped()
event to the interest.Triggering readiness notifications
Readiness can be updated using
Registration::set_readiness
andRegistration::unset_readiness
. These can be called concurrently.set_readiness
adds the given events with the existingRegistration
readiness.unset_readiness
subtracts the given events from the existingRegistration
.Registration::set_readiness
ensures that the registration node is queued for processing.Delaying readiness
In order to support timeouts,
Registration
has the ability to schedule readiness notifications usingRegistration::delay_readiness(events, timeout)
.There is a big caveat. There is precise timing guarantee. A delayed readiness event could be triggered much earlier than requested. Also, the readiness timer is coarse grained, so by default will be rounded to 100ms or so. The one guarantee is that the event will be triggered no later than the requested timeout + the duration of a timer tick (100ms by default).
Queuing
Registration
for processingFirst, atomically update
Registration.queued
. Attempt to set the MSB. Check the current delay value. If the requested delay is less than the current, update the delayed portion ofqueued
.If the MSB was successfully set, then the current thread is responsible for queuing the registration node (pseudocode):
Dropping
Registration
Processing a drop is handled by setting readiness to an internal
Dropped
event:The
Registration
value itself does not own any data, so there is nothing else to do.Polling
On
Poll::poll()
the following happens:Reset the events on self
Atomically take ownership of the readiness queue:
The dequeued nodes are processed.
The next step is to process all delayed readiness nodes that have reached their timeout. The code for this is similar to the current timer code.
Integrating with
Selector
The readiness queue described above is not to replace socket notifications on epoll / kqueue / etc... It is to be used in conjuction.
To handle this,
PollReadinessQueue
will be able to wakup the selector. This will be implemented in a similar fashion as the current channel implementation. A pipe will be used to force the selector to wakeup.The full logic of
poll
will look something like:Implementing
mio::Channel
Channel
is a mpsc queue such that when messages are pushed onto the channel,Poll
is woken up and returns a readable readiness event for theChannel
. The specific queue will be supplied on creation ofChannel
, allowing the user to choose the behavior around allocation and capacity.Channel
will look something like:When a new message is sent over the channel:
When readiness is set,
Poll
will wake up with a readiness notification. The user can now "poll" off of the channel. The implementation of poll is something like:Implemented
Timer
Timer
is a delay queue. Messages are pushed onto it with a delay after which the message can be "popped" from the queue. It is implemented using a hashed wheel timer strategy which is ideal in situations where large number of timeouts are required and the timer can use coarse precision (by default, 100ms ticks).The implementation is fairly straight forward. When a timeout is requested, the message is stored in the
Timer
implementation andRegistration::delay_readiness
is called with the timeout. There are some potential optimizations, but those are out of scope for this proposal.Windows
The readiness queue described in this proposal would replace the current windows specific implementation. The proposal implementation would be more efficient as it avoids locking as well as uses lighter weight data structures (mostly, linked lists vs. vecs).
Outstanding questions
The biggest outstanding question would be what to do about
EventLoop
. If this proposal lands, thenEventLoop
becomes entirely a very light shim aroundPoll
that dispatches events to the appropriate handler function.The entire implementation would look something like:
It will also not be possible to maintain API compatibility.
Handler::notify
andHandler::timeout
will no longer exist asEventLoop
does not know the difference between those two types and otherEvented
types that have notifications called throughready
.The options are:
EventLoop
to follow the new API and keep the minimal impelmentation.EventLoop
and makePoll
the primary APIEventLoop
that accepts allocations (though this would be post 1.0).Alternatives
It is possible to implement
Timer
andChannel
as standalone types without having to implement the readiness queue. ForTimer
, it would require using timerfd on linux and a timer thread on other platforms. The disadvanage here is minor for linux as syscalls can be reduced significantly by only usingtimerfd
to track the next timeout in theTimer
vs. every timeout inTimer
.However, on platforms that don't have
timerfd
available, a polyfill will be needed. This can be done by creating a pipe and spawning a thread. When a timeout is needed, send a request to the thread. The thread writes a byte to the pipe after the timeout has expired. This has overhead, but again it can be amortized by only using the thread/pipe combo for the next timeout inTimer
vs. every timeout. Though, there may be some complication with this amoritization when registering theTimer
using level triggered notifications.On the other hand. For
Channel
, a syscall would be needed for each message enqueued and dequeued. The implementation would be to have a pipe associated with theChanenl
. Each time a message is enqueued, write a byte to the pipe. Whenever a message is dequeued, read a byte.