NOTE: THIS PACKAGE IS DEPRECATED. PLEASE SWITCH OVER TO USING THE BenchmarkTools.jl PACKAGE

Benchmarks.jl

A package to make Julia benchmarking both easy and rigorous. See the design section for details about what we mean by rigor.

Basic Usage Example

For basic benchmarking, you can use the @benchmark macro, which takes in a Julia function call expression and returns the results of benchmarking that expression:

using Benchmarks

@benchmark sin(2.0)

@benchmark sum(rand(1000))

@benchmark rand(1:100)

@benchmark (3.0+5im)^3.2

@benchmark svd(rand(10, 10))

The @benchmark macro will repeatedly call the function many times in order to rigorously estimate its actual execution time (see below for more details). It is designed to test the performance of a single function as a complementary tool alongside @code_warntype and @code_llvm. To this end, all arguments to the benchmarked function are evaluated before benchmarking it; they are considered setup. This means that, unlike the builtin @time macro, @benchmark sum(rand(1000)) will not include the time or allocations involved in generating the 1000-length random vector. It also means that it will repeatedly sum the same vector many times, so if there is additional variance in the timings due to data dependence a single @benchmark call will not uncover it.

Advanced Usage Example

For more advanced usage (and for all automated usage), you need to use a lower-level interface than the @benchmark macro. Here we give a minimal example of that lower-level interface:

import Benchmarks

Benchmarks.@benchmarkable(
    sin_benchmark!, # benchmark name
    nothing,        # setup expression
    sin(2.0),       # benchmark "core"
    nothing         # teardown expression
)

r = Benchmarks.execute(sin_benchmark!)

In the design section, we'll describe the way that the lower-level interface works.

The Design of the Benchmarks Package

Benchmarking is hard. To do it well, you have to care about the details of how code is executed and the statistical challenges that make it hard to generalize correctly from benchmark data. To explain the rationale for design of the package, we discuss the problems that the package is designed to solve.

Measurement Error: Benchmarks that Measure the Wrong Thing

The first problem that the Benchmarks package tries to solve is the problem of measurement error: benchmarking any expression that can be evaluated faster than the system clock's resolution can track is vulnerable to measuring the system clock's performance rather than the expression's performance. Naive estimates, like those generated by the @time macro are often totally inaccurate. To convince yourself, consider the following timings:

@time rand(100);

@time sum(rand(100));

@time sum(rand(100));

On my system, the results of this code often suggest that sum(rand(100)) can be evaluated faster than rand(100). That's obviously absurd -- and the timings returned by @time are unusably inaccurate for small functions like these.

The reason for that inaccuracy is that summing 100 elements can occur much faster than the system clock's finest resolution. As such, you're almost exclusively measuring variability in the system clock when you evaluate @time sum(rand(100)).

To deal with this, the Benchmarks package exploits a simple linear relationship: evaluating an expression N times in a row should take approximately N times as long as evaluating the same expression exactly 1 time. This linear relationship holds almost perfectly as N grows. Thus, we solve the problem of measurement error by measuring the time it takes to evaluate sum or rand a very large number of times. Then we apply linear regression to estimate the time it takes to evaluate each function exactly once.

Since the @benchmark macro considers its arguments as setup, it detangles the two operations. You can immediately see that summing is an order of magnitude faster than the array and random number creation. If you'd like to measure the two operations together, it's quite simple to create and benchmark a temporary function:

f() = sum(rand(1000))
@benchmark f()

Accounting for Variability

When you repeatedly evaluate the same expression, you find that the timings you measure are not all the same. Instead, there is often substantial variability across measurements.

Traditional statistical methods are often employed to resolve this problem. The Benchmarks package tries to estimate the average time that it would take to evaluate an expression. Because the average is estimated from a small sample of measurements, we have to acknowledge that our estimate is uncertain. We do this by reporting 95% confidence intervals (hereafter called CI's) for the average time per evaluation.

Constrained Budgets

Getting the best estimate of the average time requires gathering as many samples as possible. Most people want their benchmarks to run in a finite amount of time, so we need to impose some budget. The Benchmarks package exposes both a sample budget and a time budget. For very long computations, the time budget prevents the benchmarking process from taking more than 10s. This can be configured.

For mid-range computations, we also allow users to insist on acquiring at most 100 samples. This can be effective if the time budget seems excessive for the purpose of estimating CI's accurately.

For very fast computations that require OLS modeling, we ignore the samples budget, although we respect the user's specificed time budget. This is because users cannot reasonably expect to know how many samples they need to gather to estimate the average time accurately.

Dependence on External Resources

Sometimes we want to benchmark functions that depend upon external resources. For example, we want to time how long it takes us to parse a file, which we must first download.

To handle these scenarios, the @benchmarkable macro takes in three expressions:

setup: This expression will be evaluated once before the benchmark starts executing.
core: This expression is the core expression we want to benchmark.
teardown: This expression will be evaluated once after the core expression stops executing.

Note that it's not possible to affect the core's function arguments through the setup or teardown expressions.

Avoiding unwanted optimizations

A guiding design principle for the Benchmarks package is that it should measure the function's performance as it would typically perform in real-world usage, and do so as consistently as possible. Repeatedly calling a function many thousands of times in a benchmarking loop, however, is not very typical. If we are not careful in the design, LLVM may be able to perform optimizations that interact with the benchmarking loop or remove it entirely.

In order to prevent unwanted optimizations, the @benchmark macro protects the function call within an inner @noinline function. By preventing it from inlining into the benchmarking loop, we force it to always make a function call. This puts a bound on the minimum time resolution at one function call, even if the benchmarked function does nothing and would otherwise take no time at all:

julia> noop() = nothing
       @benchmark noop()
================ Benchmark Results ========================
     Time per evaluation: 3.04 ns [3.04 ns, 3.05 ns]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 11701
   Number of evaluations: 139311001
         R² of OLS model: 0.997
 Time spent benchmarking: 0.53 s

Note that the noop() function call still inlines as it normally would into the inner function; the above just represents the time it takes to execute one iteration of a loop with just one function call.

Another difficulty is that Julia can sometimes perform constant-propagation from a function's arguments to simplify the algorithms in its interior. As a simple example, let's look at 1+2. The @code_llvm macro shows that it adds its arguments... but if we create a function where the arguments to + are constant, it will inline and remove the addition entirely!

julia> @code_llvm 1+2
define i64 @"julia_+_23345"(i64, i64) {
top:
  %2 = add i64 %1, %0
  ret i64 %2
}

julia> f() = 1+2
       @code_llvm f()
define i64 @julia_f_23346() {
top:
  ret i64 3
}

In general, we don't know whether a user wants to consider their arguments as constants or not. In order to ensure that the behavior is always consistent, we prevent this constant propagation through the function boundary. This typically makes @benchmark behave as though it is executing the code displayed by @code_llvm/@code_native.

johnmyleswhite / Benchmarks.jl

readme