Benchmark for large futures

The difference between async-trait and the native async fn in traits implementation brings in the question of accessing large futures by not boxing them. This benchmark creates 100K u64's on the stack and runs the actor passing around state to be in both scenarios in order to assess the impact of not boxing the futures for performance

Results

Without `async-trait`, using native `async fn` in the trait

3 Trials, with 50k u64's on the stack

$ cargo bench --bench async_traits -p ractor --no-default-features -F tokio_runtime
    Finished bench [optimized] target(s) in 1.57s
     Running benches/async_traits.rs (target/release/deps/async_traits-c230612eb7565929)
Gnuplot not found, using plotters backend
Waiting on 50 messages with large data in the Future to be processed
                        time:   [134.00 µs 135.02 µs 136.16 µs]
                        change: [+186.13% +189.02% +191.85%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

$ cargo bench --bench async_traits -p ractor --no-default-features -F tokio_runtime
    Finished bench [optimized] target(s) in 0.23s
     Running benches/async_traits.rs (target/release/deps/async_traits-c230612eb7565929)
Gnuplot not found, using plotters backend
Waiting on 50 messages with large data in the Future to be processed
                        time:   [127.87 µs 128.85 µs 129.98 µs]
                        change: [-4.6896% -3.6128% -2.5389%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

$ cargo bench --bench async_traits -p ractor --no-default-features -F tokio_runtime
    Finished bench [optimized] target(s) in 0.27s
     Running benches/async_traits.rs (target/release/deps/async_traits-c230612eb7565929)
Gnuplot not found, using plotters backend
Waiting on 50 messages with large data in the Future to be processed
                        time:   [127.87 µs 128.90 µs 129.95 µs]
                        change: [-1.3544% -0.2167% +0.9568%] (p = 0.72 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Using `async-trait`

3 Trials, with 50k u64's on the stack

$ cargo bench --bench async_traits -p ractor --no-default-features -F tokio_runtime,async-trait
    Finished bench [optimized] target(s) in 1.57s
     Running benches/async_traits.rs (target/release/deps/async_traits-9971906e70dc3ec1)
Gnuplot not found, using plotters backend
Waiting on 50 messages with large data in the Future to be processed
                        time:   [134.10 µs 134.85 µs 135.60 µs]
                        change: [+3.7305% +4.8950% +5.9779%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

$ cargo bench --bench async_traits -p ractor --no-default-features -F tokio_runtime,async-trait
    Finished bench [optimized] target(s) in 0.33s
     Running benches/async_traits.rs (target/release/deps/async_traits-9971906e70dc3ec1)
Gnuplot not found, using plotters backend
Waiting on 50 messages with large data in the Future to be processed
                        time:   [135.92 µs 136.63 µs 137.35 µs]
                        change: [+0.9529% +2.2175% +3.9541%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

$ cargo bench --bench async_traits -p ractor --no-default-features -F tokio_runtime,async-trait
    Finished bench [optimized] target(s) in 0.25s
     Running benches/async_traits.rs (target/release/deps/async_traits-9971906e70dc3ec1)
Gnuplot not found, using plotters backend
Waiting on 50 messages with large data in the Future to be processed
                        time:   [132.12 µs 132.77 µs 133.41 µs]
                        change: [-5.5130% -3.9929% -2.7036%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

Here we can see very similar timing information between the two implementations. If we reduce the stack-size data, then the native async fn implementation starts outperforming the async-trait version.

slawlor / ractor