google / benchmark

A microbenchmark support library
Apache License 2.0
8.86k stars 1.6k forks source link

Fortran and Rust Language Support #776

Open AndrewGaspar opened 5 years ago

AndrewGaspar commented 5 years ago

This issue proposes addition of support for Rust and Fortran. I briefly discussed this with @dominichamon on IRC. I have an implementation, but need organizational approval to open source. Posting here first to get buy in before taking on the open source process.

Motivations

My organization is historically a Fortran shop. As we evaluated new options, we wanted to be able to write representative benchmarks in C++, Fortran, and Rust to compare code and performance. I chose to use Google Benchmark for all three, as the benchmark execution is reasonably configurable, and the output is standardized. This makes it easy to run comparisons between the different languages using Google Benchmark's built-in tools. While Rust has competitors to Google Benchmark (cargo-bench, Criterion, etc.), Fortran doesn't have any obviously compelling benchmarking libraries in the style of Google Benchmark.

Mechanism

The implementation of this is done using a C ABI on top of the core Google Benchmark library. Currently everything has to go through the C boundary, including calls to KeepRunning, which introduces some small overhead. iso_c_binding is used to bind the ABI in Fortran portably, and Rust's FFI support is used to bind the ABI in Rust portably.

Rust Bindings

Current implementation only supports imperative registration of benchmarks, rather than declarative ala BENCHMARK.

The user must manually implement fn main in the current implementation:

use benchmark::{benchmarks, benchmarks_generic};

fn main() {
    benchmark::initialize();

    // register benchmarks

    benchmark::run_benchmarks();
}

A benchmark is declared much like in C++:

mod my_mod {
    pub fn foo(mut state: benchmark::State) {
        while state.keep_running() {
            // code to benchmark goes here
        }
    }
}

Generic benchmarks are also supported:

mod vector_add {
pub fn index<T: Float>(mut state: benchmark::State) {
    let vec_size = state.range(0) as usize;

    let a = vec![T::zero(); vec_size];
    let b = vec![T::zero(); vec_size];

    let mut c = vec![T::zero(); vec_size];

    while state.keep_running() {
        for x in 0..vec_size {
            c[x] = a[x] + b[x];
        }
    }

    benchmark::do_not_optimize(&c);
}
}

We provide two macros for registering benchmarks: benchmarks! and benchmarks_generic!. They each allow registering multiple benchmarks in a single macro invocation. They allow you to register a benchmark without having to restate the benchmark name, like BENCHMARK.

For non-generic benchmarks, registering is simple:

// in `fn main`
benchmarks!(my_mod::foo, my_mod::bar);

For generic benchmarks, you must specify all the types you want to specialize the benchmarks for:

// in `fn main`
    benchmarks_generic!(
        f32, f64;
        vector_add::index,
        vector_add::index_slice,
        vector_add::index_unsafe,
        vector_add::zip,
        vector_add::zip_collect,
    );

Rendered as such:

----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
vector_add::index<f32>                    70705 ns      70547 ns       9356
vector_add::index<f64>                    75650 ns      75468 ns       9728
vector_add::index_slice<f32>              54973 ns      54827 ns      11872
vector_add::index_slice<f64>              58562 ns      58495 ns      11259
vector_add::index_unsafe<f32>             56875 ns      56718 ns      12077
vector_add::index_unsafe<f64>             69196 ns      68970 ns      10034
vector_add::zip<f32>                      21533 ns      21479 ns      32005
vector_add::zip<f64>                      45947 ns      45666 ns      16122
vector_add::zip_collect<f32>              22993 ns      22943 ns      29634
vector_add::zip_collect<f64>              44486 ns      44397 ns      15397

Setting options on benchmarks is fluent similar to C++:

    benchmarks_generic!(
        f32, f64;
        csr::mat_vec_tridiag_rayon,
        csr::mat_vec_tridiag_rayon_chunked,
    )
    .range_multiplier(10)
    .range(100_000, 100_000_000)
    .use_real_time();

Fortran Bindings

I'm guessing there will be less interest in this, but may be useful to some.

I was having issues with constructing the command line arguments to pass to Google Benchmark directly from Fortran, so today the entry point is implemented in C++ and Fortran is expected to implement an entry point called RegisterBenchmarksMain:

module my_benchmarks
  use benchmark
  implicit none
contains
  subroutine RegisterBenchmarksMain() bind(C, name="RegisterBenchmarksMain")
    type(benchmark_t), pointer :: bench

    bench => benchmark_register("vector_add_idiomatic_omp", vector_add_idiomatic_omp)
    call bench%range_multiplier(10)
    call bench%range(100, 100000000)
    call bench%use_real_time
  end subroutine RegisterBenchmarksMain
end module my_benchmarks

We only provide a single benchmark_register routine. Haven't though about how to make this more natural yet. Currently don't love the use of pointer here - I was new to Fortran at the time I wrote this, would probably change it.

Rendered:

2019-03-05 14:16:59
Running ./ftn-bench/ftn-bench
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
------------------------------------------------------------------------------------
Benchmark                                             Time           CPU Iterations
------------------------------------------------------------------------------------
vector_add_idiomatic_omp/100/real_time            32977 ns      12489 ns      20413
vector_add_idiomatic_omp/1000/real_time           35360 ns      14234 ns      20329
vector_add_idiomatic_omp/10000/real_time          35635 ns      11754 ns      19610
vector_add_idiomatic_omp/100000/real_time        131073 ns      22574 ns       5454
vector_add_idiomatic_omp/1000000/real_time      1406534 ns      75481 ns        489
vector_add_idiomatic_omp/10000000/real_time    14943647 ns     704200 ns         45
vector_add_idiomatic_omp/100000000/real_time  837987651 ns      47000 ns          1

Overhead

There is currently some small overhead, as mentioned before, in the core loop due to the call to KeepRunning. I think this is fixable, at least for Rust, which supports cross-module inlining even without LTO. Here's what the overhead is on my machine (MacBook Pro 2017).

C++

Test

void baseline_keep_running(benchmark::State &state) {
    while (state.KeepRunning()) {
    }
}

BENCHMARK(baseline_keep_running);

void baseline_for(benchmark::State &state) {
    for (auto _ : state) {
    }
}

BENCHMARK(baseline_for);

Results

-------------------------------------------------------------
Benchmark                      Time           CPU Iterations
-------------------------------------------------------------
baseline_keep_running          0 ns          0 ns 1000000000
baseline_for                   0 ns          0 ns 1000000000

Rust

Test

pub fn keep_running(mut state: State) {
    while state.keep_running() {}
}

Results

--------------------------------------------------------------
Benchmark                       Time           CPU Iterations
--------------------------------------------------------------
baseline::keep_running          2 ns          2 ns  303938170

Fortran

Test

  subroutine baseline_keep_running(state)
    type(benchmark_state_t), intent(inout) :: state

    do while (state%keep_running()); end do
  end subroutine baseline_keep_running

Results

-------------------------------------------------------------
Benchmark                      Time           CPU Iterations
-------------------------------------------------------------
baseline_keep_running          3 ns          3 ns  229682906
dmah42 commented 5 years ago

This is great. Restrictions and overhead aside, having a binding is better than not having one (and as you point out we can improve it over time if necessary).

nealepetrillo commented 4 years ago

I'd really like to see this mainlined to support our multi-language codes!

kc9jud commented 3 months ago

@AndrewGaspar Did this ever get open-sourced anywhere?

AndrewGaspar commented 3 months ago

No, sorry 😞