Support for running criterion benches as tests?

jszwedko commented 2 years ago

Hi all!

I'm a big fan of what this project is doing.

I noticed when trying to integrate this into https://github.com/vectordotdev/vector that it fails to run test binaries built from criterion benchmarks which don't support the same --format flag that normal test binaries support:

cargo nextest run --workspace --no-fail-fast --no-default-features --features "default metrics-benches codecs-benches language-benches remap-benches statistic-benches dnstap-benches benches"
    Finished test [unoptimized + debuginfo] target(s) in 1.31s
error: Found argument '--format' which wasn't expected, or isn't valid in this context

USAGE:
    limit-2c27c6bee8522ca1 --list

For more information try --help
Error:
   0: error building test list
   1: running ''/Users/jesse.szwedko/workspace/vector/target/debug/deps/limit-2c27c6bee8522ca1 --list --format terse'' failed
   2: command ["/Users/jesse.szwedko/workspace/vector/target/debug/deps/limit-2c27c6bee8522ca1", "--list", "--format", "terse"] exited with code 1

Backtrace omitted.
Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
make: *** [test] Error 1

Running --help on the test binary:

Criterion Benchmark 

USAGE:
    limit-2c27c6bee8522ca1 [FLAGS] [OPTIONS] [FILTER]

FLAGS:
    -h, --help       Prints help information
        --list       List all benchmarks
    -n, --noplot     Disable plot and HTML generation.
    -v, --verbose    Print additional statistical information.

OPTIONS:
    -b, --baseline <baseline>                        Compare to a named baseline.
    -c, --color <color>
            Configure coloring of output. always = always colorize output, never = never colorize output, auto =
            colorize output if output is a tty and compiled for unix. [default: auto]  [possible values: auto, always,
            never]
        --confidence-level <confidence-level>        Changes the default confidence level for this run. [default: 0.95]
        --load-baseline <load-baseline>              Load a previous baseline instead of sampling new data.
        --measurement-time <measurement-time>        Changes the default measurement time for this run. [default: 5]
        --noise-threshold <noise-threshold>          Changes the default noise threshold for this run. [default: 0.01]
        --nresamples <nresamples>
            Changes the default number of resamples for this run. [default: 100000]

        --output-format <output-format>
            Change the CLI output format. By default, Criterion.rs will use its own format. If output format is set to
            'bencher', Criterion.rs will print output in a format that resembles the 'bencher' crate. [default:
            criterion]  [possible values: criterion, bencher]
        --plotting-backend <plotting-backend>
            Set the plotting backend. By default, Criterion.rs will use the gnuplot backend if gnuplot is available, or
            the plotters backend if it isn't. [possible values: gnuplot, plotters]
        --profile-time <profile-time>
            Iterate each benchmark for approximately the given number of seconds, doing no analysis and without storing
            the results. Useful for running the benchmarks in a profiler.
        --sample-size <sample-size>                  Changes the default size of the sample for this run. [default: 100]
    -s, --save-baseline <save-baseline>              Save results under a named baseline. [default: base]
        --significance-level <significance-level>
            Changes the default significance level for this run. [default: 0.05]

        --warm-up-time <warm-up-time>                Changes the default warm up time for this run. [default: 3]

ARGS:
    <FILTER>    Skip benchmarks whose names do not contain FILTER.

This executable is a Criterion.rs benchmark.
See https://github.com/bheisler/criterion.rs for more details.

To enable debug output, define the environment variable CRITERION_DEBUG.
Criterion.rs will output more debug information and will save the gnuplot
scripts alongside the generated plots.

To test that the benchmarks work, run `cargo test --benches`

NOTE: If you see an 'unrecognized option' error using any of the options above, see:
https://bheisler.github.io/criterion.rs/book/faq.html

I was just curious to get thoughts on handling this. Should I stick with normal cargo test --benches for that target for now and use nextest for the other targets?

sunshowers commented 2 years ago

Thanks for reporting this! I think the easiest thing to do would be to make criterion support the --format terse option and output lines in the format nextest accepts: https://nexte.st/book/custom-test-harnesses.html

Note that following the standard Rust benchmark library, lines ending in ": bench" are ignored. We may want to add an option to run benchmarks as tests if that's a desired use case (is that what you want?)

jszwedko commented 2 years ago

Thanks for the response! That makes sense, I missed that section in the book. I'll open an issue over on criterion to see if they'd be open to adding those flags.

Note that following the standard Rust benchmark library, lines ending in ": bench" are ignored. We may want to add an option to run benchmarks as tests if that's a desired use case (is that what you want?)

Yeah, we have had cases where benches were broken accidentally so we started running them as tests in CI which seems to just run one benchmark iteration and lets us validate that no assertions failed (for criterion at least). That would be great if nextest supported them, but I can just run cargo test --benches -- --bench separately for now.

sunshowers commented 2 years ago

There's also https://github.com/nextest-rs/nextest/issues/38 which might be useful.

sunshowers commented 2 years ago

I think we can support this with a --run-benches option, similar to --run-ignored today.

sunshowers commented 2 years ago

I looked at this a bit more and it seems like criterion binaries built with cargo test don't appear to behave with --list:

/home/rain/dev/cargo-guppy/target/debug/deps/package_graph-4455225dc7640272 --list
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
Testing make_package_graph
Success

Testing depends_on
Success

Testing depends_on_cache
Success

Testing into_ids
Success

Testing resolve_package_name
Success

Testing make_package_name_hashmap
Success

Testing make_cycles
Success

I think this will need to be fixed in criterion first -- marking this "help wanted" in case someone wants to pick up the work in criterion. See https://nexte.st/book/custom-test-harnesses for more documentation (note that we also accept lines ending with : bench and currently skip them, but we can choose to not do that in the future.)

jszwedko commented 2 years ago

Thanks for following up @sunshowers ! I agree https://github.com/bheisler/criterion.rs/issues/562 will need to be resolved first.

sunshowers commented 2 years ago

https://github.com/bheisler/criterion.rs/pull/602 addresses this.

sunshowers commented 1 year ago

Criterion 0.5, which came out a few days ago, supports nextest.

nextest-rs / nextest

Support for running criterion benches as tests? #96