Counting internal iterations

3tilley commented 5 months ago

Hello,

I'm really enjoying using divan and I think it lives up to its namesake of being easier to create quick benchmarks.

One issue I've come up against is how to benchmark functions that perform an operation multiple times. I think there are two items here, the first one being an opportunity to improve some of the docs, which I'm happy to help with. Between sample_size, sample_count, item_count, iterations it's not entirely obvious what affects what. I think I can quickly improve that though.

The second is I believe a missing feature that can make it look like there is buggy behaviour. I'm using the demo code below, with two functions. One adds a pairs of numbers together n times, ones sleeps for 100us n times. I've added the sleep because the compiler likes to optimise the add, and sleep makes the issue very clear.

I'd like to be able to indicate to divan that the function I'm running already has some iterations baked into it, and to account for that in its reporting. What's confusing (and initially made me think there was a bug) is that if I add an ItemsCount the throughput correctly accounts for this, but the total time doesn't.

Demo code:

// main.rs
use std::hint::black_box;
use std::thread::sleep;
use std::time::Duration;

pub fn add_multiple(left: usize, right: usize, n: usize) -> usize {
    let mut sum = 0;
    for _ in 0..n {
        sum += black_box(left) + right
    }
    sum
}

pub fn sleeper(micros: usize, n: usize) {
    for _ in 0..n {
        sleep(Duration::from_micros(micros as u64))
    }
}

#[divan::bench(args = [1, 10, 100, 1000, 10000])]
fn add_n_times(bencher: divan::Bencher, n: usize) {
    bencher.counter(n).bench_local(|| add_multiple(3, 4, n));
}

#[divan::bench(args = [1, 10, 100, 1000], sample_count=10)]
fn sleep_n_times(bencher: divan::Bencher, n: usize) {
    bencher.counter(n).bench_local(|| sleeper(100, n));
}

fn main() {
    divan::Divan::from_args().run_benches();
}

Output

$ cargo -q run --release
Timer precision: 100 ns
divan_n           fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ add_n_times                  │               │               │               │         │
│  ├─ 1           1.188 ns      │ 2.262 ns      │ 1.2 ns        │ 1.298 ns      │ 100     │ 819200
│  │              841.7 Mitem/s │ 442 Mitem/s   │ 833.3 Mitem/s │ 770.4 Mitem/s │         │
│  ├─ 10          3.715 ns      │ 9.208 ns      │ 3.91 ns       │ 4.15 ns       │ 100     │ 409600
│  │              2.691 Gitem/s │ 1.086 Gitem/s │ 2.557 Gitem/s │ 2.409 Gitem/s │         │
│  ├─ 100         46.63 ns      │ 59.91 ns      │ 47.02 ns      │ 48.45 ns      │ 100     │ 25600
│  │              2.144 Gitem/s │ 1.669 Gitem/s │ 2.126 Gitem/s │ 2.063 Gitem/s │         │
│  ├─ 1000        474.7 ns      │ 577.8 ns      │ 477.8 ns      │ 478.1 ns      │ 100     │ 3200
│  │              2.106 Gitem/s │ 1.73 Gitem/s  │ 2.092 Gitem/s │ 2.091 Gitem/s │         │
│  ╰─ 10000       4.749 µs      │ 5.924 µs      │ 4.774 µs      │ 4.807 µs      │ 100     │ 400
│                 2.105 Gitem/s │ 1.687 Gitem/s │ 2.094 Gitem/s │ 2.08 Gitem/s  │         │
╰─ sleep_n_times                │               │               │               │         │
   ├─ 1           533.4 µs      │ 555.2 µs      │ 537.9 µs      │ 541.1 µs      │ 10      │ 10
   │              1.874 Kitem/s │ 1.8 Kitem/s   │ 1.858 Kitem/s │ 1.847 Kitem/s │         │
   ├─ 10          5.271 ms      │ 5.792 ms      │ 5.322 ms      │ 5.37 ms       │ 10      │ 10
   │              1.897 Kitem/s │ 1.726 Kitem/s │ 1.878 Kitem/s │ 1.862 Kitem/s │         │
   ├─ 100         54.41 ms      │ 57.03 ms      │ 55.95 ms      │ 55.96 ms      │ 10      │ 10
   │              1.837 Kitem/s │ 1.753 Kitem/s │ 1.787 Kitem/s │ 1.786 Kitem/s │         │
   ╰─ 1000        559.7 ms      │ 597.5 ms      │ 577.4 ms      │ 575.6 ms      │ 10      │ 10
                  1.786 Kitem/s │ 1.673 Kitem/s │ 1.731 Kitem/s │ 1.737 Kitem/s │         │

It's easier to see on sleep, but it's pretty clear that the time jumps by an order of magnitude per line, but throughput stays consistent.

Just focussing on the median, for each line the output gives: n | total_time | (n / time) | iterations=iterations

I'd like a way to make it output: n | total_time / n | (n / time) | iterations=iterations * n

Solutions

1. Make `ItemCount` affect the reported time taken

I actually think this is the clearest, but it's a breaking change, and it's confusing with the other members of the enum, BytesCount and CharCount which presumably wouldn't affect the reported time

2. Add another way of indicating to `bencher` that there are multiple iterations

This would be fine, apart from there are already lots of very similar concepts used (samples, iterations, items) which mean different things, and adding another one adds to the mental load. That solution might look something like:

#[divan::bench(args = [1, 10, 100, 1000], sample_count=10)]
fn sleep_n_times(bencher: divan::Bencher, n: usize) {
    bencher
        .internal_iterations(n)
        .bench_local(|| sleeper(100, n));
}

I'd be happy to make either change, but I suspect it might be contentious, so I'd love to hear your thoughts! Note that the above functionality code of course be achieved by adding / sleeping once, and having divan handle all of the iterations. But there are some realworld functions that are either vectorised, or simply do more than one iteration already.

And if what I'm asking is already possible, I'm very happy to add it to the documentation!

nvzqz commented 5 months ago

I didn't consider this initially but you're right that there should be a way to focus benchmarks on throughput rather than latency. Right now the output prioritizes minimum time spent. If each benchmark has the same number of internal iterations, or internal iterations get externally normalized, then benchmarks across functions are somewhat comparable.

While individual SIMD benchmarks are valuable, I feel that end-to-end benchmarking of an algorithm that uses high-throughput SIMD techniques would be more valuable/concrete.

3tilley commented 4 months ago

I think I understand what you're saying, but just in case I haven't the issue to my eye is not that there is anything wrong with the throughput measurements or reporting, but their inconsistency to the latency. I would say the throughput is displayed correctly, and is very helpful, it's the latency reporting that I'd like to change to account for internal iterations. Maybe I didn't express that very well though!

Comparing to similar tools, pyperf takes inner_loops as a argument, and it doesn't look like criterion supports this use case, it looks similar to the way that divan does it.

nvzqz / divan