rust-lang / libm

A port of MUSL's libm to Rust.
Apache License 2.0
534 stars 96 forks source link

New benchmarks #220

Open kpp opened 5 years ago

kpp commented 5 years ago

For now there are some PRs blocked because of lack of good benches (https://github.com/rust-lang-nursery/libm/pull/169 and friends).

Lokathor commented 5 years ago

I believe that there's some settings you can use with criterion that will cause the task to error out, and thus CI to fail, if a run regresses by too much on a benchmark. For this to work, you need to have some "baseline" data checked in somewhere in the repo of course. I have only used criterion a tiny bit, so I don't want to say for sure.

Minor Reminder: black_box is somewhat dodgy because fundamentally there's no way for it to work. So instead it kinda mostly works, but using it too much or too little can quickly throw off the results.

kpp commented 5 years ago

Things discovered:

  1. No need to iterate a small function to benchmark it https://bheisler.github.io/criterion.rs/book/faq.html#how-should-i-benchmark-small-functions

  2. You used black_box in a wrong way:

--- a/crates/libm-bench/benches/bench.rs
+++ b/crates/libm-bench/benches/bench.rs
@@ -11,13 +11,13 @@ macro_rules! unary {
         pub fn [<$func>](bh: &mut Bencher) {
             let mut rng = rand::thread_rng();
             let x = rng.gen::<f64>();
-            bh.iter(|| test::black_box(libm::[<$func>](x)))
+            bh.iter(|| test::black_box(libm::[<$func>](test::black_box(x))))
         }
         #[bench]
         pub fn [<$func f>](bh: &mut Bencher) {
             let mut rng = rand::thread_rng();
             let x = rng.gen::<f32>();
-            bh.iter(|| test::black_box(libm::[<$func f>](x)))
+            bh.iter(|| test::black_box(libm::[<$func f>](test::black_box(x))))
         }
     }

And I got real results for exp2f: 1ns regressed to 4ns. I wrote a simple bench using criterion:

fn bench_exp2f(c: &mut Criterion) {
    c.bench_function("exp2f", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f32>(),
            |v| libm::exp2f(v),
            BatchSize::SmallInput
        )
    });
}

And it gave me time: [3.7904 ns 3.8298 ns 3.8823 ns]. The ~ same result as libtest.

Testing exp2f on a constant with no black_box:

fn bench_exp2f_c(c: &mut Criterion) {
    c.bench_function("exp2f c", |b| b.iter(|| libm::exp2f(20f32)));
}

gave me time: [286.86 ps 288.06 ps 289.53 ps] for inlined and cached result. So the strategy should be to test a real execution, not inlined.

  1. We can actually fail a CI stage on bench regressions: https://bheisler.github.io/criterion.rs/book/faq.html#how-should-i-run-criterionrs-benchmarks-in-a-ci-pipeline
gnzlbg commented 5 years ago

The math functions in this libraries are not constant time, that is, they take a different amount of time to execute depending on the input value passed to them because they do a different amount of work. For example, calling exp2f(NaN) does a test and returns NaN, while exp2f(3.14) does relatively a lot of computation.

The current benchmarks generate a random value each time, making comparing the measurements of two benchmarks runs meaningless, e.g., the first time exp2f(NaN) might be called, while the second time expf2(3.14) might be called, producing two completely different results, even when the same algorithm is used.

That is, the current benchmarks results are meaningless for evaluating whether one algorithm is faster than another. It can go either way, depending on the randomly generated inputs.

If you want to compare two algorithms, the first thing you need is a way to test that the algorithms are correct. The current tests only test that the result "almost" match those of musl for a couple of inputs, which isn't the same thing as being correct. For example, if the allowed error is 1ULP, and musl has 1ULP error, having 1ULP of error with respect to musl might be that the algorithm has 2ULP errors with respect to the exact result and is therefore wrong. There are also a couple of things that we don't test (e.g. FP exception flags, that the fe-environment is properly restored) and that are important when these functions are called from other programming languages like C or assembly, which is a requirement for a global C-compliant math library.

Then you need to actually generate benchmarks for relevant inputs requiring different amounts of work, and only then can you compare two algorithms, and see if one is truly better than the other for all inputs, only some, etc.

kpp commented 5 years ago

The math functions in this libraries are not constant time, that is, they take a different amount of time to execute depending on the input value passed to them because they do a different amount of work.

They are not constant time in terms of crypto but overall they do a constant amount of work in the worst case.

fn bench_exp2f_param(c: &mut Criterion) {
    c.bench(
        "exp2f",
        ParameterizedBenchmark::new("exp2f",
                |b, i| b.iter(|| libm::exp2f(*i)),
                vec![std::f32::NAN, 0f32, 3.14])
    );
}

fn bench_new_exp2f_param(c: &mut Criterion) {
    c.bench(
        "exp2f",
        ParameterizedBenchmark::new("new exp2f",
                |b, i| b.iter(|| libm::new_exp2f(*i)),
                vec![std::f32::NAN, 0f32, 3.14])
    );
}

gives

Benchmarking exp2f/NaN: Collecting 100 samples in estimated 5.0000 s (4.8B itera                                                                                exp2f/NaN               time:   [1.0307 ns 1.0327 ns 1.0350 ns]
                        change: [-0.6508% +0.3993% +1.4343%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  8 (8.00%) high severe
Benchmarking exp2f/0.0: Collecting 100 samples in estimated 5.0000 s (4.8B itera                                                                                exp2f/0.0               time:   [1.0255 ns 1.0272 ns 1.0291 ns]
                        change: [-0.7784% +0.4512% +1.8042%] (p = 0.49 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
Benchmarking exp2f/3.14: Collecting 100 samples in estimated 5.0000 s (1.4B iter                                                                                exp2f/3.14              time:   [3.6689 ns 3.6768 ns 3.6856 ns]
                        change: [-2.8241% -0.9796% +0.6951%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

Benchmarking exp2f/new exp2f/NaN: Collecting 100 samples in estimated 5.0000 s (                                                                                exp2f/new exp2f/NaN     time:   [772.36 ps 773.91 ps 775.78 ps]
                        change: [-7.6337% -6.3208% -5.2124%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe
Benchmarking exp2f/new exp2f/0.0: Collecting 100 samples in estimated 5.0000 s (                                                                                exp2f/new exp2f/0.0     time:   [520.55 ps 522.72 ps 525.03 ps]
                        change: [-8.7749% -7.4035% -6.1551%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
Benchmarking exp2f/new exp2f/3.14: Collecting 100 samples in estimated 5.0000 s                                                                                 exp2f/new exp2f/3.14    time:   [14.995 ns 15.056 ns 15.139 ns]
                        change: [-1.7571% -0.2988% +0.8373%] (p = 0.69 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  8 (8.00%) high severe

Which is fair. Isn't it?

That is, the current benchmarks results are meaningless for evaluating whether one algorithm is faster than another. It can go either way, depending on the randomly generated inputs.

Yes they can, however I generate a random value per each call which gives us an avg cost of a call over the distribution of f32.

kpp commented 5 years ago

Then you need to actually generate benchmarks for relevant inputs requiring different amounts of work, and only then can you compare two algorithms, and see if one is truly better than the other for all inputs, only some, etc.

Alright. I got it. In the current benchmarks we bench an average time, whereas you would like to see some special cases. How about moving to criterion to bench avg in the first iteration and then add special cases one by one and only if we need them?

gnzlbg commented 5 years ago

Alright. I got it. In the current benchmarks we bench an average time, whereas you would like to see some special cases.

We don't benchmark average times. The current benchmarks measure the time it takes for two different implementations to compute two different things.

Trying to conclude anything about averages from that is flawed.

If you want to make any claims about average performance, you either need an actual proof, or to measure the performance of all possible inputs, and average that.

How about moving to criterion to bench avg in the first iteration and then add special cases one by one and only if we need them?

What's the point of doing this? What would comparing two algorithms for two different random inputs tell you about the algorithm ?

At best you can compare the same algorithm against itself, and conclude that your algorithm is orders of magnitude slower or faster depending on the inputs, e.g., libstd pow vs libstd pow.

kpp commented 5 years ago

We don't benchmark average times.

You benchmark the performance over a one single random input:

        #[bench]
        pub fn [<$func>](bh: &mut Bencher) {
            let mut rng = rand::thread_rng();
            let x = rng.gen::<f64>();
            bh.iter(|| libm::[<$func>](test::black_box(x)))
        }

It can be the best case, it can be the worst case, it can be somewhere in the middle.

The current benchmarks measure the time it takes for two different implementations to compute two different things.

That's right. That's what I am talking about.

you either need an actual proof, or to measure the performance of all possible inputs, and average that.

An output from cargo bench:

Benchmarking exp2f: Collecting 100 samples in estimated 5.0001 s (341M iterations)

Go and calculate the probability of all these 341M iterations be the worst cases for an old implementation and the best cases for a new implementation.

Here are nice plots for avg benches.

Old

image

  Lower bound Estimate Upper bound
Slope 6.3326 ns 6.4026 ns 6.4725 ns
0.7447419 0.7558406 0.7447811
Mean 6.3566 ns 6.4577 ns 6.5724 ns
Std. Dev. 373.22 ps 554.70 ps 727.17 ps
Median 6.2948 ns 6.4147 ns 6.4715 ns
MAD 269.11 ps 363.41 ps 434.88 ps

New

image

  Lower bound Estimate Upper bound
Slope 24.340 ns 24.559 ns 24.776 ns
0.8354173 0.8439594 0.8355921
Mean 24.477 ns 24.793 ns 25.125 ns
Std. Dev. 1.2167 ns 1.6566 ns 2.0507 ns
Median 24.381 ns 24.579 ns 24.799 ns
MAD 771.79 ps 1.1077 ns 1.3439 ns

Understanding this report:

The plot on the left displays the average time per iteration for this benchmark. The shaded region shows the estimated probabilty of an iteration taking a certain amount of time, while the line shows the mean. Click on the plot for a larger view showing the outliers.

The plot on the right shows the linear regression calculated from the measurements. Each point represents a sample, though here it shows the total time for the sample rather than time per iteration. The line is the line of best fit for these measurements.

See the documentation for more details on the additional statistics.

Let's get back to the subject

What's the point of doing this? What would comparing two algorithms for two different random inputs tell you about the algorithm ?

A lot. More than comparing two different implementations for only two different inputs.

gnzlbg commented 5 years ago

A lot

@kpp could you articulate what that is, with words?

You are generating a random input once, and then measuring a math function for that single input many times. The function is pure, does no I/O, does not access the memory hierarchy, etc. The only thing all those beautiful graphs are showing is how much noise your measurement system has (the fastest execution takes ~15ns, the slowest ~3Xns, that's some noise!).

I at least do not care about that. I care about what's the execution time of the function, and the closest value one can obtain to that from criterion, which is a very bad tool for this (you probably want to use the TSC, which criterion doesn't use), is the time of the fastest execution.

So the only information your N page criterion report is giving you is a single number, the fastest execution of algorithm A0 for input I0. Everything else could be relevant for other algorithms that do memory i/o, file i/o, network i/o, etc. but isn't relevant here. (EDIT: or at least I don't know why we would care about how slow can a CPU under load execute a single function without characterizing the load, where the answer to how slow is probably infinitely slow if the OS never schedules it).

Now, because you pick up the input at random, when you rerun for algorithm A1, you get the fastest execution of algorithm A1 for the different input I1.

What does that tell you? Absolutely nothing, you have two measurements for two different algorithms, but because these are for two different inputs, you cannot conclude anything about the relative performance of algorithm A0 and A1.

Were you to use the same input I for measuring both algorithms, you would get two results, and at least be able to conclude, for that single input, which of the two algorithms is faster.

kpp commented 5 years ago

could you articulate what that is, with words?

Yes.

Let's say we got a function F with special cases at 0, NaN, a small computation at (0, a] and a lot of computation at (a, INF). Let's assume we got four benches with hardcoded constants B1=0, B2=NaN, B3 <- (0, a], B4 <- (a, INF).

So we got four bench results:

B1 = 2ns
B2 = 2ns
B3 = 4ns
B4 = 6ns

A new impl will give us:

B1 = 100us
B2 = 100us
B3 = 5ns
B4 = 1ns

That is a significant improvement for B1, B2, B4 (6ns -> 1ns) but a little bit of regression for B3 (4ns -> 5ns). Can we accept a PR with such changes without avg bench having no idea what a is?

kpp commented 5 years ago

Were you to use the same input I for measuring both algorithms, you would get two results, and at least be able to conclude, for that single input, which of the two algorithms is faster.

We can use the same seed in benchmarks =)

gnzlbg commented 5 years ago

We can use the same seed in benchmarks =)

Sure, but the examples you were using above (e.g. here) did not do that.

Can we accept a PR with such changes without avg bench having no idea what a is?

If someone posts a benchmark comparing two implementations I am going to ask two questions:

gnzlbg commented 5 years ago

So ideally we would have, for each function, a set of inputs that would allow us to get a good picture of the performance characteristics for the function and that we can use to compare different implementations. Such a set can grow over time.

Also ideally, we would have a good way to measure how long does it take for a function to actually run. Instead of criterion, on x86 at least, we can probably just use the TSC to measure how long it takes for a single function to run for a low N number of times for the same input. This only gives us the average over N executions, so we can try to repeat this a couple of times to obtain the smallest value of this average. Maybe we can make N as low as 1, I don't know, might depend on the API being tested and the input being passed.

kpp commented 5 years ago

If someone posts a benchmark comparing two implementations I am going to ask two questions:

Either way, I see this as a KO, and IMO we need a better benchmark.

Yes, another benchmark. With an a → INF you will get a performance boost for 3 special cases only and a regression for all inputs excluding 0, NaN, INF, an avg bench will show that in average you will get a regression from 4ns to 5ns which is meaningful. Right? With an a → 0 we will get avg boost from 6ns to 1ns which is great.

So ideally we would have, for each function, a set of inputs that would allow us to get a good picture of the performance characteristics for the function and that we can use to compare different implementations. Such a set can grow over time.

Yes. That's what I suggested in https://github.com/rust-lang-nursery/libm/issues/220#issuecomment-520790464. We can have both avg with a const seed and const len of rand inputs and a set of special cases that can grow over time.

@gnzlbg your thoughts?

gnzlbg commented 5 years ago

I'm not sure. Imagine the case of powf and studying the benchmarks results of a -> INF. The larger the input the larger the costs, so the large values in your random distribution will skew the average. So doing this type of benchmarking might not be very useful for such a function. This suggests that we might want to consider for which functions do we do this on a 1:1 basis.

If the function takes more or less the same time when computing any input in a -> inf, then this might be fine I guess. But then again, if that is something we know, then I do not see much value into testing for random values in the whole range, as opposed to picking one or two. We could do this as a sanity test, e.g., to maybe try to detect outliers. For example, if we know that for a function all inputs in some range should more or less take the same time, we could try this type of testing to try to notice when an algorithm change suddenly makes some of those inputs take more time.

Does this make sense?

kpp commented 5 years ago

Does this make sense?

Definitely.

takes more or less the same time when computing any input in a -> inf

For example, if we know that for a function all inputs in some range should more or less take the same time, we could try this type of testing to try to notice when an algorithm change suddenly makes some of those inputs take more time.

I assumed all functions in libm in some range (preferably (-INF, INF)) should more or less take the same time. Now I have to figure out whether it's true or not.

Does is sound like a plan?

Schultzer commented 5 years ago

@gnzlbg Would it make sense to run the benchmark against all f32 bit patterns for unary functions? We could generate benchmarks for both libm and the systems.

gnzlbg commented 5 years ago

Does is sound like a plan?

Sure. I recommend searching for loop keywords like "while" in the library to see where the loops are. Also take into account periodic functions, e.g., sin, that take an argument and map it into a region (e.g. [0, 2pi]) and then evaluate it in there. It probably makes litter sense to test for a->inf for these, because you would only be testing for [0, 2pi] anyways.

Would it make sense to run the benchmark against all f32 bit patterns for unary functions?

IIRC running tests for all bit-patterns for all unary f32 functions took like 30-60 seconds on my machine. Doing 100 iterations would put that in the 1-2h ballpark, but one never needs to run these benchmarks for all functions (e.g. typically a PR discusses a single function). So for a single function benchmarking all inputs might be doable.

Doing this would give us, e.g., the fastest and slowest inputs for a function, which is something that we could then add to the manually specified benchmarks. This would already make it worth doing IMO.

Computing the average over all inputs is definitely a metric that can be used to conclude whether an implementation is faster or slower than another on average. Whether that's useful information, I don't know. It depends on the function.

burrbull commented 5 years ago

It is interesting to plot average_time = f(x) for sin(x).

gnzlbg commented 5 years ago

@burrbull for all x? or only for x in [0, 2pi] (and inf, nan, etc.) ?

burrbull commented 5 years ago

For both.

kpp commented 5 years ago

UPD description.

kpp commented 5 years ago

Hey guys... we used rng.gen::<f64>() in a wrong way.

let x: f64 = rng.gen(); // random number in range [0, 1)

According to https://rust-random.github.io/book/guide-start.html

kpp commented 5 years ago

It is interesting to plot average_time = f(x) for sin(x).

@burrbull

fn bench_sin(c: &mut Criterion) {
    c.bench(
        "sin",
        ParameterizedBenchmark::new("const",
                |b, i| b.iter(|| libm::sin(*i)),
                vec![std::f64::NAN, std::f64::INFINITY, std::f64::NEG_INFINITY])
    );
    c.bench_function("sin [0,1)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>(),
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [1,2)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() + 1.,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [2,3)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() + 2.,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [3,4)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() + 3.,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [4,5)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() + 4.,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [5,6)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() + 5.,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [6,7)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() + 6.,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
    c.bench_function("sin [0, 2*PI)", |b| {
        b.iter_batched(
            || rand::thread_rng().gen::<f64>() * 2. * std::f64::consts::PI,
            |v| libm::sin(v),
            BatchSize::SmallInput
        )
    });
}
sin/const/NaN           time:   [2.3111 ns 2.3144 ns 2.3181 ns]                           
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
sin/const/inf           time:   [2.3131 ns 2.3233 ns 2.3377 ns]                           
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  7 (7.00%) high severe
sin/const/-inf          time:   [2.3768 ns 2.3807 ns 2.3851 ns]                            
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

sin [0,1)               time:   [5.8142 ns 5.8405 ns 5.8655 ns]                       
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

sin [1,2)               time:   [5.8471 ns 5.8847 ns 5.9300 ns]                       
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

sin [2,3)               time:   [8.6651 ns 8.8362 ns 9.0381 ns]                       
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

sin [3,4)               time:   [6.3458 ns 6.3718 ns 6.3982 ns]                       
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

sin [4,5)               time:   [6.2409 ns 6.3532 ns 6.4619 ns]                       
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

sin [5,6)               time:   [9.7713 ns 9.8215 ns 9.8846 ns]                       
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

sin [6,7)               time:   [5.7219 ns 5.7407 ns 5.7619 ns]                       
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

sin [0, 2*PI)           time:   [12.104 ns 12.254 ns 12.450 ns]                           
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
burrbull commented 5 years ago

Why [0, 2*PI) slower then all chunks? It should be something average.

burrbull commented 5 years ago

Can criterion exclude outliers and show time without them?

kpp commented 5 years ago

Why [0, 2*PI) slower then all chunks?

I don't know =(

Lokathor commented 5 years ago

Wild guess: it messes up the branch prediction more often.

Schultzer commented 5 years ago

~2x Multiplication is inherently slower than 1x addition.~

I would properly just run the benchmark against all bit patterns to begin with.

for i in 0..=u32::max_value() {
    let x = f32::from_bits(i);
    libm::sinf(x);
}
kpp commented 5 years ago

Here are some benches of libm found on the internets: https://github.com/JuliaMath/openlibm/blob/master/test/libm-bench.cpp https://github.com/nega0/cpml/blob/master/cpml.c https://github.com/bminor/glibc/tree/master/benchtests