`bencher.with_inputs()` affects benchmark timing

polarathene commented 7 months ago

While trying Divan I noticed when I added a bench with a heavier input the timing was quite a bit high.
Since I had only used basic benches prior I thought it was my function in bench_values(), yet when I removed it there was still notable overhead.
After removing the mock data function call for input, that overhead was gone.

The docs for with_inputs() imply this should not contribute to timing?:

Generate inputs for the benchmarked function.

Time spent generating inputs does not affect benchmark timing.

$ cargo bench

Timer precision: 10 ns
example                fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ with_dynamic_input  660.7 ns      │ 1.312 µs      │ 670.7 ns      │ 718.5 ns      │ 100     │ 200
╰─ with_static_input   0.004 ns      │ 0.008 ns      │ 0.004 ns      │ 0.004 ns      │ 100     │ 819200

// benches/example.rs
use divan::{black_box, Bencher};
use std::collections::HashMap;

fn main() {
  divan::main();
}

#[divan::bench]
fn with_dynamic_input(b: Bencher) {
    b.with_inputs(|| {
        mock_data()
    }).bench_values(|_| {
        black_box(42);
    });
}

#[divan::bench]
fn with_static_input(b: Bencher) {
    b.with_inputs(|| {
        42
    }).bench_values(|_| {
        black_box(42);
    });
}

// Something to generate a larger input:
// Actual method randomized fixed length strings

fn mock_data() -> Vec<String> {
    (0..128).map(|_| {
      "divan".to_string()
    }).collect()
}

I've tried to look through the current docs and announcement blogpost, but it's not clear to me why this is happening?

If I instead swap the mock data generator function for one that just sleeps instead of allocates:

use std::{thread, time};

fn wait_a_bit() {
  let delay = time::Duration::from_nanos(500);
  thread::sleep(delay);
}

It will take longer to complete, but does not affect the timing:

Timer precision: 10 ns
example                fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ with_dynamic_input  0.002 ns      │ 1.747 ns      │ 0.018 ns      │ 0.093 ns      │ 100     │ 409600
╰─ with_static_input   0 ns          │ 0.002 ns      │ 0 ns          │ 0.001 ns      │ 100     │ 819200

It only affects timing when I return the mocked data and the impact scales with the size of the input. I assume it's related to the example description of Drop behaviour here, but I'm not sure how to exclude that overhead.

polarathene commented 7 months ago

I just noticed that compared to the strings example, I was using bench_values() instead of bench_refs(). The latter avoids the overhead.

I assume that you should still be able to bench a method by value without the timing overhead from the input.

nvzqz commented 7 months ago

I was using bench_values() instead of bench_refs(). The latter avoids the overhead.

It appears what's happening is you're dropping Vec<String> in every run of your benchmarked function. The timing taking longer is expected in this case. You would need to forward it back to Divan by returning it. Or use bench_refs(), which does semantically the same (inout argument).

polarathene commented 7 months ago

You would need to forward it back to Divan by returning it.

That was it, cheers!

Is that related to the Drop example in docs? It only mentioned a memory cost, with a later example of not needing to return a value.

The benched function would take a hashmap as input and return a string. That value was also implicitly returned in the bench_values() closure, but I didn't expect an unused input to introduce timing overhead when not used.

// Slow without return
#[divan::bench]
fn with_dynamic_input(b: Bencher) {
    b.with_inputs(|| {
        mock_data()
    }).bench_values(|_| {
        black_box(42);
    });
}

// Slow with return
#[divan::bench]
fn with_dynamic_input(b: Bencher) {
    b.with_inputs(|| {
        mock_data()
    }).bench_values(|_| {
        // without black_box, also still slow
        black_box(42)
    });
}

// Fast, no notable timing overhead issues
#[divan::bench]
fn with_dynamic_input(b: Bencher) {
    b.with_inputs(|| {
        mock_data()
    }).bench_values(|x| {
        x
    });
}

Might be worth clarifying that in with_inputs() to mention timing will be affected if the input does not defer drop?

Or I guess this issue is sufficient itself for troubleshooting the cause 👍

nvzqz commented 7 months ago

but I didn't expect an unused input to introduce timing overhead when not used.

Rust was doing what you were telling it to. You just weren't telling it what you wanted it to do 😉. I was hoping the name "values" would help make the move semantics obvious but it seems to still be subtle.

@polarathene Please submit a PR to improve the docs here! I'm sure you would have good phrasing for it.

polarathene commented 7 months ago

Rust was doing what you were telling it to. You just weren't telling it what you wanted it to do 😉. I was hoping the name "values" would help make the move semantics obvious but it seems to still be subtle.

I thought from the black_box() docs on compiler optimizations, an unused parameter wouldn't have that sort of impact.

I'm familiar with value vs ref and thought I understood them well enough 😅 Just to be clear my confusion with this issue was:

My benched function did take the input value and use it.
I was curious and replaced my function to just return a simple value, expecting it to be quick like my earlier benches which didn't use with_inputs().
I did not expect divan to report time unrelated to what was going outside of the benched function, especially when the input was changed to _ and not used 🤷‍♂️ This was my first time trying benching with Rust, so it was perhaps the wrong expectation.

#[divan::bench]
fn with_dynamic_input(b: Bencher) {
    b.with_inputs(|| {
        mock_data()
    }).bench_values(|x| {
        // If operation doesn't use input, or input is not returned as it was
        // Timing Overhead, regardless of operation
        do_stuff(x)

        // <1ns if instead returning input as-is
        x
    });
}

// <1ns, no overhead
fn do_stuff(s: Vec<String>) -> Vec<String> {
    s
}

// 700ns, returning modified input
fn do_stuff(mut s: Vec<String>) -> Vec<String> {
    s.truncate(1);
    s
}

// 1.4µs, returning different type from input
fn do_stuff(mut s: Vec<String>) -> String {
    s.concat()
}

// 700ns, returning a different type, ignoring input
fn do_stuff(s: Vec<String>) -> u8 {
    42
}

// 700ns, not returning anything
fn do_stuff(s: Vec<String>) {
    42;
}

When the timing is <1ns my understanding is it's been optimized out.
Everything else that is taking a minimum of 700ns is unrelated to the function, and more to do with the input itself adding overhead? 🤷‍♂️
This behaviour was odd to me. Not that I'd avoid using the input in the benched function call, I just didn't expect it to be intentionally contributing to the measured timing like that, since I wanted to compare multiple approaches with the same sized input, without overhead noise.

@polarathene Please submit a PR to improve the docs here! I'm sure you would have good phrasing for it.

I'm happy to submit a PR if another user chimes in with the same confusion, except given the above I still don't quite follow why divan should be including that 700ns 🤔

In my last comment, I thought I just had to consume the input and/or return something, thinking the timing was due to my function logic alone after all, but it seems there is still overhead from the input size regardless of what happens in do_stuff() (at least when passing by value).

nvzqz / divan

`bencher.with_inputs()` affects benchmark timing #35