tokio-rs / mio

Metal I/O library for Rust.
MIT License
6.35k stars 735 forks source link

Proposal: Unify Sockets, Timers, and Channels #360

Closed carllerche closed 8 years ago

carllerche commented 8 years ago

Unify Sockets, Timers, and Channels

Currently, there are two runtime APIs: Poll and EventLoop. Poll is the abstraction handling readiness for sockets and waiting for events on these sockets. EventLoop is a wrapper around Poll providing a timeout API and a cross thread notification API as well as the loop / reactive (via Handler) abstraction.

I am proposing to extract the timer and the notification channel features into standalone types that implement Evented and thus can be used directly with Poll. For example:

let timer = mio::Timer::new();
let poll = mio::Poll::new();

poll.register(&timer, Token(0), EventSet::readable(), PollOpt::edge());

timer.timeout("hello", Duration::from_millis(1_000));

poll.poll();

The insight is that a lot of the code that is currently windows specific is useful to all platforms. Stabilizing the API and providing it to all platforms allows implementing Evented for arbitrary types.

Advantages

let (tx, rx) = mio::Channel::new(mpsc::channel());
poll.register(&rx, Token(0), EventSet::readable(), PollOpt::edge());

Disadvantages

The primary disadvantage that I can think of is that the code path around timers & the notification channel become slightly more complicated. I don't believe that the change would have a meaningful performance impact.

There is also additional code complexity for all platforms. However, this code complexity already exists for Windows.

Behavior

An Evented would mirror the behavior of a socket registered with epoll. Specifically, in a single threaded environment:

In the presence of concurrency, specifically readiness being modified on a different thread than Poll, a best effort is made to preserve these semantics.

Implementation

This section will describe how to implement a custom Evented type as well as Mio's internals to handle it. For simplicity and performance, custom Evented types will only be able to be registered with a single Poll.

It is important to note that the implementation is not intended to replace FD polling on epoll & kqueue. It is meant to work in conjunction with the OS's event queue to support types that cannot be implemented using a socket or other system type that is compatible with the system's event queue.

Readiness Queue

Poll will maintain an internal readiness queue, represented as a linked list. The linked list head pointer is an AtomicPtr. All of the nodes in the linked list are owned by the Poll instance.

The type declarations are for illustration only. The actual implementations will have some additional memory safety requirements.

struct Poll {
    readiness_queue: Arc<PollReadinessQueue>,
}

struct PollReadinessQueue {
    // All readiness nodes owned by the `Poll` instance. When the `Poll`
    // instance is freed, the list is walked and each Arc ref count is
    // decremented.
    head_all_nodes: Box<ReadinessNode>,

    // linked list of nodes that are pending some processing
    head_readiness: AtomicPtr<ReadinessNode>,

    // Hashed wheel timer for delayed readiness notifications
    readiness_wheel: Vec<AtomicPtr<ReadinessNode>>,
}

struct ReadinessNode {
    // Next node in ownership tracking queue
    next_all_nodes: Box<ReadinessNode>,
    // Used when the node is queued in the readiness linked list OR the
    // linked list for a hashed wheel slot.
    next_readiness: *mut ReadinessNode,
    // The Token used to register the `Evented` with `Poll`. This can change,
    // but only by calling `Poll` functions, so there will be no concurrency.
    token: Token,
    // The set of events to include in the notification on next poll
    events: AtomicUsize,
    // Tracks if the node is queued for readiness using the MSB, the
    // rest of the usize is the readiness delay.
    queued: AtomicUsize,
    // Both interest and opts can be mutated
    interest: Cell<EventSet>,
    // Poll opts
    opts: Cell<PollOpt>,
}

// Implements `Sync`, aka all functions are safe to call concurrently
struct Registration {
    node: *mut ReadinessNode,
    queue: Arc<PollReadinessQueue>,
}

struct MyEvented {
    registration: Option<Registration>,
}

Registration

When a MyEvented value is registered with the event loop, a new Registration value is obtained:

my_evented.registration = Some(Registration::new(poll, token, interest));

Registration will include the internal EventSet::dropped() event to the interest.

Re-registration

A Registration's interest & PollOpt can be changed by calling Registration::update:

// poll: &Poll
my_evented.registration.as_ref().unwrap()
    .update(poll, interest, opts);

The Poll reference will not be used but will ensure that update is only called from a single thread (the thread that owns the Poll reference). This allows safe mutation of interest and opts without synchronization primitives.

Registration will include the internal EventSet::dropped() event to the interest.

Triggering readiness notifications

Readiness can be updated using Registration::set_readiness and Registration::unset_readiness. These can be called concurrently. set_readiness adds the given events with the existing Registration readiness. unset_readiness subtracts the given events from the existing Registration.

my_evented.registration.as_ref().unwrap().set_readiness(EventSet::readable());
my_evented.registration.as_ref().unwrap().unset_readiness(EventSet::readable());

Registration::set_readiness ensures that the registration node is queued for processing.

Delaying readiness

In order to support timeouts, Registration has the ability to schedule readiness notifications using Registration::delay_readiness(events, timeout).

There is a big caveat. There is precise timing guarantee. A delayed readiness event could be triggered much earlier than requested. Also, the readiness timer is coarse grained, so by default will be rounded to 100ms or so. The one guarantee is that the event will be triggered no later than the requested timeout + the duration of a timer tick (100ms by default).

Queuing Registration for processing

First, atomically update Registration.queued. Attempt to set the MSB. Check the current delay value. If the requested delay is less than the current, update the delayed portion of queued.

If the MSB was successfully set, then the current thread is responsible for queuing the registration node (pseudocode):

loop {
    let ptr = PollReadinessQueue.readiness_head.get();
    ReadinessNode.next_readiness = ptr;

    if PollReadinessQueue.readiness_head.compare_and_swap(ptr, &ReadinessNode) {
        return;
    }
}

Dropping Registration

Processing a drop is handled by setting readiness to an internal Dropped event:

fn drop(&mut self) {
    self.registration.as_ref().unwrap()
        .set_readiness(EventSet::dropped());
}

The Registration value itself does not own any data, so there is nothing else to do.

Polling

On Poll::poll() the following happens:

Reset the events on self

self.events.clear();

Atomically take ownership of the readiness queue:

let ready_nodes = PollReadinessQueue.readiness_head.swap(ptr::null());

The dequeued nodes are processed.

for node in ready_nodes {
    // Mask the readiness info by the node's interest. This is needed to
    // support concurrent setting of readiness. Another thread may not
    // be aware of the latest interest value.
    let mut events = node.events.get() & node.interest;

    // Used to read the delay component of `Registration::queued`.
    let delay;

    if opts.is_edge() || events.is_empty() {
        // If the registration is edge, the node is always dequeued. If
        // it is level, we only dequeue the event when there are no
        // events (aka, no readiness). By not dequeing the event it will
        // be processed again next call to `poll`
        delay = unset_msb_and_read_delay_component(&node.queued);

        // Reload the events to ensure that we don't "lose" any
        // readiness notifications. Remember, it's ok to have
        // spurious notifications. 
        events = node.events.get() | node.interest;
    } else if !events.is_drop() {
        // Push the node back into the queue. This is done via a compare
        // and swap on `readiness_head`, pushing the node back to the
        // front.
        prepend(&ready_nodes, node);

        delay = read_delay_component(&node.queued);
    }

    if delay > 0 {
        node.update_delay_in_hashed_wheel(delay);
    } else {
        // The event will be fired immediately, if the node is currently
        // being delayed, remove it from the hashed wheel.
        if node.is_currently_in_hashed_wheel() {
            node.remove_from_hashed_wheel();
        }

        if events.is_drop() {
            // The registration is dropped. Unlink the node, freeing the
            // memory.
            node.unlink();
            continue;
        }

        if !events.is_empty() {
            // Track the event
            self.events.push_event(node.token, events);
        }
    }

}

The next step is to process all delayed readiness nodes that have reached their timeout. The code for this is similar to the current timer code.

Integrating with Selector

The readiness queue described above is not to replace socket notifications on epoll / kqueue / etc... It is to be used in conjuction.

To handle this, PollReadinessQueue will be able to wakup the selector. This will be implemented in a similar fashion as the current channel implementation. A pipe will be used to force the selector to wakeup.

The full logic of poll will look something like:

let has_pending = !readiness_queue.is_empty();

if has_pending {
    // Original timeout value is passed to the function...
    timeout = 0;
}

// Poll selector
selector.poll(&mut self.events, timeout);

// Process custom evented readiness queue as specified above.

Implementing mio::Channel

Channel is a mpsc queue such that when messages are pushed onto the channel, Poll is woken up and returns a readable readiness event for the Channel. The specific queue will be supplied on creation of Channel, allowing the user to choose the behavior around allocation and capacity.

Channel will look something like:

struct Channel<Q> {
    queue: Q,

    // Poll registration
    registration: Option<Registration>,

    // Tracks the number of pending messages.
    pending: AtomicUsize,
}

When a new message is sent over the channel:

self.queue.push(msg);

let prev = self.pending.fetch_add(1);

if prev == 0 {
    // set readiness
    self.registration.as_ref().unwrap()
        .set_readiness(EventSet::readable());
}

When readiness is set, Poll will wake up with a readiness notification. The user can now "poll" off of the channel. The implementation of poll is something like:

self.queue.poll().map(|msg| {
    let first = self.pending.get();

    if first == 1 {
        self.registration.as_ref().unwrap()
            .unset_readiness(EventSet::readable());
    }

    let second = self.pending.fetch_sub(1);

    if first == 1 && second > 0 {
        // There still are pending messages, reset readiness
        self.registration.as_ref().unwrap()
            .set_readiness(EventSet::readable());
    }

    msg
})

Implemented Timer

Timer is a delay queue. Messages are pushed onto it with a delay after which the message can be "popped" from the queue. It is implemented using a hashed wheel timer strategy which is ideal in situations where large number of timeouts are required and the timer can use coarse precision (by default, 100ms ticks).

The implementation is fairly straight forward. When a timeout is requested, the message is stored in the Timer implementation and Registration::delay_readiness is called with the timeout. There are some potential optimizations, but those are out of scope for this proposal.

Windows

The readiness queue described in this proposal would replace the current windows specific implementation. The proposal implementation would be more efficient as it avoids locking as well as uses lighter weight data structures (mostly, linked lists vs. vecs).

Outstanding questions

The biggest outstanding question would be what to do about EventLoop. If this proposal lands, then EventLoop becomes entirely a very light shim around Poll that dispatches events to the appropriate handler function.

The entire implementation would look something like:

pub fn run(&mut self, handler: &mut H) -> io::Result<()> {
    self.run = true;

    while self.run {
        self.poll.poll();

        for event in self.poll.events() {
            handler.ready(self, event.token(), event.kind());
        }

        handler.tick(self);
    }
}

It will also not be possible to maintain API compatibility. Handler::notify and Handler::timeout will no longer exist as EventLoop does not know the difference between those two types and other Evented types that have notifications called through ready.

The options are:

It is possible to implement Timer and Channel as standalone types without having to implement the readiness queue. For Timer, it would require using timerfd on linux and a timer thread on other platforms. The disadvanage here is minor for linux as syscalls can be reduced significantly by only using timerfd to track the next timeout in the Timer vs. every timeout in Timer.

However, on platforms that don't have timerfd available, a polyfill will be needed. This can be done by creating a pipe and spawning a thread. When a timeout is needed, send a request to the thread. The thread writes a byte to the pipe after the timeout has expired. This has overhead, but again it can be amortized by only using the thread/pipe combo for the next timeout in Timer vs. every timeout. Though, there may be some complication with this amoritization when registering the Timer using level triggered notifications.

On the other hand. For Channel, a syscall would be needed for each message enqueued and dequeued. The implementation would be to have a pipe associated with the Chanenl. Each time a message is enqueued, write a byte to the pipe. Whenever a message is dequeued, read a byte.

carllerche commented 8 years ago

cc @reem @dwrensha @dpc @rrichardson @Hoverbear @tailhook @zonyitoo @jamwt @polyfractal @geofft @wycats @thijsc @vhbit @fhartwig @alexcrichton @aturon

aturon commented 8 years ago

Thanks for putting this together, @carllerche!

alexcrichton commented 8 years ago

I'm pretty excited about this, definitely thanks for putting it together @carllerche! Allowing essentially arbitrary types implement Evented sounds like it would indeed very nicely bridge the current Windows implementation and the channel/timer implementation today.

Some thoughts of mine:

Overall this all looks really great to me, though, I'm quite excited to see how it pans out!

zonyitoo commented 8 years ago

Oh yeah, that are great changes. And I agree with one of the points by @alexcrichton , which is to use &self instead of &mut self to make modifying Pool thread-safe.

dpc commented 8 years ago

Some APIs explained here seem internal to mio, so excuse if I missed or misunderstood anything.

Mioco already has mioco::Evented, allows custom types to implement it and Timer, Channel are already unified structs implemented on top of Handler::notify & Handler::timeout. mio having them in the first place would simplify mioco and make it more aligned API-wise.

mio::Handler abstraction could be completely removed, and users could call their own tick after poll.

I think Chanels should not implement actual queues. They should only carry an atomic notification, while the queue itself is up to the user. That would allow making different versions of them: fixed circular buffer, growable vec, mpsc/mpmc, etc. Of course mio could provide some standard implementations. That would fix #322.

So :+1: from me.

carllerche commented 8 years ago

Replying to @alexcrichton

How much of the lock-free manipulation is necessary?

The primary reason why I am pretty focused on this is that the current EventLoop implementation on Linux has no (user level) coordination at all. I'm very hesitant to add any mutexes or other heavy coordination code in a hot path. Timers and channels are pretty hot.

So, whether mutexes are acceptable for 1.0 is a tough call. Benchmarking something like this is a tricky without a "real world" work load.

It'd be nice to remove the wheel aspect from Poll

It would be, but I am not sure how. Something has to wake up the Selector when a timeout has expired. If Poll has a wheel, it can be used to figure out the timeout for waiting on Selector.

The only other alternate strategy that I can think of would be to spawn a timer thread that has a pipe and writes a byte to the pipe when a timeout expires. This would require having a communication channel to the thread as well to register timeouts.

Are you thinking of exporting a separate Selector type.

It will stay crate private as it is now.

Do you think it'd be appropriate to have Poll use &self instead of &mut self

I have been thinking the same, but I didn't bring it up explicitly. I think that it should be possible Poll should take &self.

carllerche commented 8 years ago

@dpc

mio::Handler abstraction could be completely removed, and users could call their own tick

This is TBD, but I believe that this may be the best option. There isn't too much utility to having EventLoop and Handler around anymore.

I think Chanels should not implement actual queues.

That is the plan. you would do something like mio::Channel::new(queue) where queue implements some "queue" trait. So, you would have the ability to specify the queue that you want with the semantics you want.

alexcrichton commented 8 years ago

Timers and channels are pretty hot.

Cool, sounds good to me. I'll browse the PR as well, but I suspect that you're right in that mutexes may just be too heavy for this use case.

It'd be nice to remove the wheel aspect from Poll

It would be, but I am not sure how.

I agree, and the wheel aspect is really just an implementation detail. From an API perspective you just request an event on a delay, which seems super reasonable!

I also agree that spawning a thread to send notifications is basically just a non-starter.

I think that it should be possible Poll should take &self.

The only major API implication I could think of is that you would pass in &mut Events (basically &mut Vec<Event>) which gets propagated by the Poll, but otherwise it seems like it should "Just Work"

joshtriplett commented 8 years ago

The new unified API seems good, but I think the queuing adds a lot of complexity behind the scenes. The proposed alternative seems highly preferable: use timerfd on any platform that supports it, with a compatible implementation on other platforms.

You may not even need a thread and pipe; poll supports a timeout, so just calculate the next timeout and pass that.

Similar mechanisms exist on Windows.

joshtriplett commented 8 years ago

For Channel, on Linux, you can use eventfd. You still need syscalls, but eventfd has far less overhead than a pipe.

jamwt commented 8 years ago

Timers and channels are pretty hot.

Just anecdotal support for this, we (Dropbox) have rust daemons that push 20+Gbps between individual machines routinely on a few mio event loops. Every operation must have a timeout, of course, to prevent resource pileup and queuing/congestion collapse, so timers are everywhere.

Based on profiling we've done, the performance of timers and channels is really critical. So I would be concerned about changes that would slow these down in a substantial way.

jamwt commented 8 years ago

General reaction from me, I'm concerned about the implementation complexity, but definitely intrigued by being able to control the behavior of the channel more directly.

I also agree having a single "readiness" API is a cleaner abstraction rather than special-casing timers and channels.

ticki commented 8 years ago

Great write-up. I really like the idea. Generalizing the different concepts makes everything a lot easier. I agree that the implementation can be quite complex, but the overall gain seems worth it.

Benchmarks would be nice, since this could potentially have a performance impact.

carllerche commented 8 years ago

@jamwt @Ticki

I'm very sensitive to the performance requirements of Mio. I should probably add a more explicit performance analysis section.

Specifically for Linux, I do not believe that there will be any change in performance assuming one channel and one timer per event loop (which is the 99% case).

To port an application from Mio 0.5 to Mio w/ this proposal landed, after creating a Poll, a single channel and a single timer struct will be constructed. Both will have a single associated allocation (the RegistrationNode). However, after that, there will be no further allocations on Mio's part.

At runtime, given usual behavior, the timer under heavy load will only queue readiness once per timer tick (default of 100ms). Under heavy load, the channel will queue readiness at most once per call to Poll::poll, assuming that the channel is drained after every single iteration.

So, on Linux, the usage of the additional logic is negligible compared to a heavy socket load. I do think that benchmarks are needed, but I think that they need to be done in a "real world" environment. Synthetic benchmarks in this case will not be very helpful as, by design, the completion queue on Linux is not in the hot path.

(on a side note, back to @alexcrichton's point about using a Mutex initially, it may be OK as long as all fast pass guards are using an atomic read).

@joshtriplett

Regarding eventfd, definitely.

Regarding code complexity, I want to emphasize strongly that there is no way to remove this code entirely from Mio. It will always have to exist at the very least for Windows. The question is whether or not it should be provided to all platforms.

carllerche commented 8 years ago

Also, to all, I am very interested in your opinions about what to do with EventLoop.

dwrensha commented 8 years ago

I very much like the idea of removing EventLoop in favor of Poll. I think this will make the library more approachable for beginners, because the interface will more closely resemble system interfaces with which they may already be familiar.

alexcrichton commented 8 years ago

I agree with @dwrensha about removing EventLoop as well. It seems "more appropriate" in terms of getting closer to what the underlying APIs are it also sidesteps questions about naming/configuration.

blaenk commented 8 years ago

I also agree with @dwrensha that using Poll directly seems more straightforward. Anyone who wants something like this can just create it like you said. I'm down for removing EventLoop and Handler.

joshtriplett commented 8 years ago

@joshtriplett

Regarding eventfd, definitely.

Awesome.

Regarding code complexity, I want to emphasize strongly that there is no way to remove this code entirely from Mio. It will always have to exist at the very least for Windows. The question is whether or not it should be provided to all platforms.

I'm not entirely sure about that. I think you can handle all of those cases on Windows using WaitForMultipleObjects in place of poll/epoll. With a socket, you can call WSAEventSelect to get an event you can pass to WaitForMultipleObjects. For a timeout, you can either use the built-in timeout parameter in WaitForMultipleObjects or create a waitable timer object. And for channels, a semaphore should work.

Is there some case those can't cover?

jnicholls commented 8 years ago

Looks good @carllerche. I think removing EventLoop and Handler make sense as well, and just have Poll with the ready events.

tailhook commented 8 years ago

I have few thoughts:

  1. It would be good remove EventLoop and Handler in favor of 20 lines of custom code (I believe most direct users of mio would be libraries like rotor and mioco anyway)
  2. The previous point requires exposing timer wheel and notification queue as public APIs
  3. If "one timer per event loop" covers 99% use cases it's ugly to implement that as time wheel (because there will be another time wheel in the application)
  4. Some applications need more precise timers (so could use binary heap or hierarchical timing wheels for implementation)

Overall, I think using poll(timeout) and Awakener (eventfd) are clean and good enough abstractions to build event loop on top. And the proposed API creates a temptation to allocate a channel/timer per state machine, which is not the goal, as far as I understand.

Sorry, I'm not familiar with windows code at all. But current Poll on linux is very simple. I would prefer to keep it as simple as it is on linux.

dpc commented 8 years ago

I don't know if "one timer per event loop" is really 99% of use cases. In mioco each coroutine can use one or more timers. Mioco has a test running million coroutines (each sleeping for a while on it's timer) and it works just fine, so I hope in new design such scenario won't suffer. I guess it's the same as @tailhook 's point 3. It's possible to implement another layer of time wheel in mioco, but I'd be happy to avoid it.

Also, I'd like to make sure there are no fixed size only designs and growable options can still be supported. Eg. in the above tests the wheel size needs to be adjusted, and I was going to fill issue to ask for "just grow the time wheel if needed" setting. I understand the point of "no allocations for performance", but IMO growing instead of crashing is much better proposition for anything except embedded applications. I'm not sure if proposed changes do change anything in that matter, so I'm just making sure.

carllerche commented 8 years ago

@dpc To be clear, right now EventLoop is hard coded to 1 timer. Each timer can handle a very large number of in-flight timeouts.

Another advantage of this proposal is that it exposes the primitives so that if the default mio timer does nto suit the requirements, an alternate timer implementation could be built.

@dpc Also, I disagree strongly about "growing vs crashing" A well designed network application will think about limits. A system that "grows" has a hard limit, but it is undefined and will be hit at an indefinite time. We can agree to disagree, but in my experience it is always better to be explicit about limits when building any robust system.

dpc commented 8 years ago

@carllerche You're probably right about network applications. But networking is not the only area where mio/mioco will be used. In fact I started mioco because I wanted more convenient way to handle unix pipes in colerr. Mio will be a standard to build any Rust io-driven code so it needs to accommodate any requirements. Lets say someone builds a logging part of backup software with mio/mioco. What should the fixed values be? I don't want my backups to fail in 10 years, because my system has now 128x more data to backup, on machine with 8x more threads and 32x more ram and some fixed values somewhere in the code are too small now.

So if possible I'd like to pass the choice of behavior to the end user. And wherever mio uses fixed resources and panics when it runs out of them, I'd be happy to see a flag, that makes it allocate more instead.

tailhook commented 8 years ago

@dpc Also, I disagree strongly about "growing vs crashing" A well designed network application will think about limits. A system that "grows" has a hard limit, but it is undefined and will be hit at an indefinite time. We can agree to disagree, but in my experience it is always better to be explicit about limits when building any robust system.

Maybe this discussion is a little bit off-topic, but it's an important one. On the one hand, I do totally agree with you that any system has a limit. On the other hand, it's unclear how to determine the limit.

Let's pretend that it's easy to determine the limit for the number of connections (the limit that is not configured in mio, but currently fixed in rotor, not sure about mioco). This limit is just ulimit -n. But how do you know which number of timers each connection has?

Fortunately, with the newest design of rotor you have exactly one timer per state machine. So unless you create a state machine per request (or something like this), it's fine to derive the limit. But when there is no such limit (like I in mioco AFAIU), it just impossible to define. What if you activate another configuration option, and there are more timers per state machine? What if timer wheel is large enough for 99.9% of the time but sometimes it crashes? How to distinguish of whether it is some kind of resource leak (like never removed timer), or just not enough [number-of-fsm * number-of-timers-per-fsm] value specified?

So I second growing timer queue solution. Usually, you'll end up with EOM because of running out of memory for buffers not for timers, I think.

jnicholls commented 8 years ago

If we're going down to primitives here and leaving higher level abstractions to other applications & libraries, I am inclined to agree that hard limits should also be managed at a higher level, given that there are the means to stat the used resources and act accordingly.

dpc commented 8 years ago

Even for general purpose networking applications fixed values are not good enough. Let's say you build a network daemon (say p2p-dropbox syncthing) and distribute it as a binary to the users. How are the fixed values are to be picked? Some people will want to use it to sync couple of files between two systems smaller than raspberry pi's), and some will be syncing terabytes of data every day in huge farm of servers.

reem commented 8 years ago

Like others, I am apprehensive about the level of unsafety in the implementation (extensive use of raw atomic primitives and raw pointers, atomics mixed with thread-unsafe tools like Cell), but I think this can be reduced by internal implementation organization without affecting the public API.

I conceptually like the proposed API but it's hard to tell in detail; I think the proposal would be clearer if it included some type signatures for the major public methods of the API, for instance Poll::register, the Evented trait, and methods on Registration.

All my other thoughts are basically covered by existing discussion; no need to repeat ourselves.

tailhook commented 8 years ago

Well, I think there is a good alternative to this, for windows support.

If we will look closer at a stack "mio > rotor > rotor-stream > rotor-http > basic-http-server" (as an example) we could observe that:

  1. Almost any network protocol could be implemented on top of rotor-stream (let's skip UDP for now for brevity)
  2. The protocol of the rotor-stream -> rotor-http is basically completion-based
  3. The performance of this layering is decent (i.e. it's comparable to nginx, without any profiling done yet)

So if we are willing to augment all the three mio, rotor, and rotor-stream, we could have the following benefits:

  1. Protocol of rotor-stream allows to know when exactly application expects some input data, and (sometimes) how much of data is required (relevant part of the protocol)
  2. Output data is basically buffered anyway, so we can steal the byte buffer from the stream implementation instead of copying/allocating the buffer again
  3. Implementation of the Notify queue and Timer wheel may be implemented similarly to the current ones (sans Awakener, I'm not sure how it's expected to work on windows)
  4. Since this makes all the synchronization on higher level of abstraction and eliminates some of memory allocations/copying, we could use simpler abstraction, like simply put state machine under mutex, and get similar performance without a lot of unsafe code

Another good reason to try this approach is to be able to use another two improvements for applications:

  1. Userspace network stack (here is a link on how useful it is)
  2. Infiniband (RDMA)

As far as I know both approaches are somewhat completion-based. And while we may accept a mediocre performance on windows, being able to utilize those two at full speed will be super-great. And the full performance of neither of them is possible by replacing just "mio". On the other hand, I believe (though, can't prove) that it's possible to reach top performance in userspace network stack and RDMA by enhancing all three (mio, rotor, rotor-stream) with some compile-time options, leaving every protocol on top of them completely unaware of the optimization.

All of this is to say that it may be better to move windows support to a higher level of abstraction.

carllerche commented 8 years ago

@tailhook Sorry, I'm not really following what you are saying.

carllerche commented 8 years ago

I've been implementing this proposal on #371 and have been hitting some conceptual issues.

Currently, according to the proposal, Registration is to be used for both updating interest AND setting readiness. Now, I am thinking that those are probably separate concerns. Poll is kind of like a consume / produce queue, but where the value being queued is a readiness notification. Updating interest is a consume concern where as setting readiness is a produce side concern. It would make sense to want to do those two things on different threads. Having the operations on a single type makes this difficult.

Second, I am having a hard time unifying setting immediate readiness vs. setting delayed readiness (for the timer). Setting immediate readiness is pretty trivial to implement in a way that is Sync. Doing so w/ delayed readiness is trickier. In part because I can't think of a use case that makes sense to concurrently set delayed readiness. The semantics of concurrently setting delayed readiness are not clear.

This means that while setting immediate readiness should be concurrent (to implement a channel), setting delayed readiness should not.

The first solution to this that I can think of is to split up Registration into three types: Registration (not Sync), SetReadiness (Sync), and SetDelayedReadiness (not Sync).

When getting a registration, there would be two constructors:

I believe that this would satisfy the requirements for implementing Timer and Channel as described in the proposal. The downside is that there are more types and you have to decide up front if you want SetReadiness or SetDelayedReadiness. You can, of course, set immediate readiness w/ SetDelayedReadiness by setting a delay of 0 and you can also get concurrent access to SetDelayedReadiness by wrapping it in a mutex.

Thoughts?

reem commented 8 years ago

Seems very reasonable to split the two types given that the consumer and producer code are likely to be quite different and in different places, and each has different concerns.

I would probably have the Registration constructors return (producer, consumer) not (consumer, producer) though.

alexcrichton commented 8 years ago

Also seems like reasonable tradeoffs to me. Out of curiosity, is Registration something that could implement Evented? In theory that's the "thing" you pass to/from the selector with register/reregister/unregister, right?

carllerche commented 8 years ago

@alexcrichton interesting thought, I think you are probably correct.

rozaliev commented 8 years ago

I might be missing something important here, but I strongly believe that there should be no synchronization of any kind in Poll (and mio in general). It should be single threaded (!Sync) abstraction over epoll_wait/kevent.

Everything else can be implemented on top of Poll.poll(). For multithreaded apps you can either dispatch actual work to worker threads or create per-thread event loop. For "worker threads" case you can even use multiple spsc queues (one per thread), that given fast (not std) spsc queue impl will given huge performance boost. For multiple eventloops synchronization is just redundant, coz there're no things to synchronize.

carllerche commented 8 years ago

You can (and should) dispatch work to a worker thread, but the worker thread needs a method to send the work back to the IO thread.

tikue commented 8 years ago

Any update on this proposal? Is it officially the way forward? If so, is there an ETA?

carllerche commented 8 years ago

371 is passing CI now. It includes the custom readiness queue and a channel implementation. I'm hoping to do the timer next but get this merged in as is.

Does anyone wish to review the code? I know it is a somewhat large PR.