jkarneges / rust-async-bench

The cost of Rust async/await
101 stars 1 forks source link

Result is a bit misleading #3

Open JakkuSakura opened 3 years ago

JakkuSakura commented 3 years ago

The benchmark result just scared me. I was quite afraid of the overhead of async. For me, even 1 us is an eternality. However, I made my own benchmark

As the result suggests, there is about 1.5 ns's performance panelty per async/await.

I think there's a few flaws with your method.

jkarneges commented 3 years ago

Hi, thanks for the comment. I took at look at your benchmark, and I think we are measuring different things.

My goal was to understand the overhead of doing I/O. In a real app, doing I/O requires managing poller registrations and making I/O syscalls, and having tasks suspend and wake, which I don't see in your benchmark. Basically I wanted to make sure these things could be handled efficiently, by having poller registrations live across multiple awaits and avoiding I/O calls on objects known to not be ready. Fortunately, this turned out to be true. Nothing about Rust async forces the developer to make extra syscalls compared to a hand-written poll loop.

A fair amount of "accounting" code is needed to achieve this though. So the other thing I was trying to understand was the minimum cost of such accounting (both for the I/O stuff and the executor/wakers). This is why I make my own executor and fake I/O objects. I am aware of mio, and in fact my fakeio module uses a similar interface. And while I agree tokio is the most mature runtime, I suspect it would have more overhead. My executor is very minimal, no heap, no Arc.

I am fairly confident my measurements show the minimum overhead of async I/O. In practice with normal runtimes, the overhead will likely be higher.

JakkuSakura commented 3 years ago

Indeed, we are measuring different things. To make it clear, I would like to make use of tokio for this benchmark and see the performance differences.

jkarneges commented 3 years ago

I like the idea of comparing to tokio. I'd need to think about how to do it though. My implementation runs the I/O reactor and executor in the same thread.

JakkuSakura commented 3 years ago

Tokio runtime has a current thread mode, which can be of use here.

------------------ Original ------------------ From: Justin Karneges @.> Date: Mon, Apr 5, 2021 0:03 AM To: jkarneges/rust-async-bench @.> Cc: QiuJiangkun @.>, Author @.> Subject: Re: [jkarneges/rust-async-bench] Result is a bit misleading (#3)

I like the idea of comparing to tokio. I'd need to think about how to do it though. My implementation runs the I/O reactor and executor in the same thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jkarneges commented 3 years ago

Yes, but I don't see a way to specify where to run the reactor code. Basically, I need to run FakeReactor::poll whenever all futures have returned Poll::Pending. It would be nice if tokio had an "on idle" hook or such.

Otherwise, all I can think of is:

1) Find a way to run the reactor as a future within the executor. I don't know if this is possible.

2) Make fakeio thread-safe, and run the reactor in a background thread before starting the single-threaded tokio executor. So there would be exactly 2 threads.

Probably the latter approach is the most practical. It would add some overhead, but it's closer to how tokio is meant to be used.

jkarneges commented 3 years ago

I discovered that the LocalPool executor from futures-rs has a run_until_stalled method, which makes it possible to run a reactor in the same thread. I've added a benchmark for that in the futuresrs branch.

The main change to the code I had to make was to use Rc in a bunch of places instead of bare references, since the LocalPool spawner requires futures to have static lifetimes. However, I was able to move all of the Rc constructions to an initialization phase before the benchmarks start, so the overhead of using Rc in the actual benchmarked code should be negligible.

Some numbers, with frs being the benchmarks with futures-rs:

run_sync                time:   [5.1110 us 5.1322 us 5.1545 us]
run_async               time:   [16.318 us 16.401 us 16.498 us]
run_async_frs           time:   [23.689 us 23.780 us 23.876 us]
run_sync_with_syscalls  time:   [136.34 us 137.01 us 137.70 us]
run_async_with_syscalls time:   [150.86 us 151.56 us 152.32 us]
run_async_frs_with_syscalls                        
                        time:   [159.86 us 160.47 us 161.14 us]

As expected, the futures-rs executor is a little slower. My guess is this is mostly due to its use of Box when spawning, and perhaps a little bit due to using Arc for waking.

Boxing futures is a very reasonable to do, though. AFAIK tokio does it too. Trying to avoid that (as my executor in this repo does) would be impractical in most real apps.

JakkuSakura commented 3 years ago

Thanks for following up.

Yes, but I don't see a way to specify where to run the reactor code. Basically, I need to run FakeReactor::poll whenever all futures have returned Poll::Pending. It would be nice if tokio had an "on idle" hook or such.

Otherwise, all I can think of is:

  1. Find a way to run the reactor as a future within the executor. I don't know if this is possible.
  2. Make fakeio thread-safe, and run the reactor in a background thread before starting the single-threaded tokio executor. So there would be exactly 2 threads.

Probably the latter approach is the most practical. It would add some overhead, but it's closer to how tokio is meant to be used.

You can implement Future trait for FakeReactor and call waker.wake_by_ref() before returning Poll::Pending in FakeReactor every time.

        let f = do_async(spawn, ctx, reactor, stats, AsyncInvoke::Connection(stream));

What's the intention of passing in ctx and reactor? AFAIK, if you implement a future, ctx will be passed automatically. In a well-designed async funtion, you don't have to pass ctx and reactor maunually

jkarneges commented 3 years ago

The problem with trying to put the reactor in a future is it needs to run only after all other futures have reported pending. It might be possible to relax this requirement by refactoring the fakeio module.

The reason ctx and reactor are passed is to avoid using thread local storage. This is because the code avoids threading primitives. I don't know if thread local storage has overhead or not, though. If it doesn't, perhaps it could be used.