microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
10.14k stars 1.5k forks source link

`<random>`: `normal_distribution` is slower than Boost #1003

Open statementreply opened 4 years ago

statementreply commented 4 years ago

Describe the bug

A benchmark by Alexander Neumann (original issue reporter) showed that std::normal_distribution with std::mt19937_64 was 4 times slower than boost::normal_distribution with std::mt19937_64.

Additional context

Part of the performance deficiency is due to #1000.

Also tracked by DevCom-86909 and Microsoft-internal VSO-486661 / AB#486661.

StephanTLavavej commented 3 months ago

What does the perf look like now that we've merged #4740?

StephanTLavavej commented 3 weeks ago

I overhauled the benchmark (dropping unnecessary code and especially the non-deterministic seeding) and got fresh numbers now that we've merged #4740.

Click to expand new benchmark: ```diff diff --git a/benchmarks/CMakeLists.txt b/benchmarks/CMakeLists.txt index 31572a96..c082ae50 100644 --- a/benchmarks/CMakeLists.txt +++ b/benchmarks/CMakeLists.txt @@ -114,6 +114,7 @@ add_benchmark(iota src/iota.cpp) add_benchmark(locale_classic src/locale_classic.cpp) add_benchmark(minmax_element src/minmax_element.cpp) add_benchmark(mismatch src/mismatch.cpp) +add_benchmark(normal_distribution src/normal_distribution.cpp) add_benchmark(path_lexically_normal src/path_lexically_normal.cpp) add_benchmark(priority_queue_push_range src/priority_queue_push_range.cpp) add_benchmark(random_integer_generation src/random_integer_generation.cpp) ``` `benchmarks/src/normal_distribution.cpp`: ```cpp #include #include #pragma warning(push) #pragma warning(disable : 4244) // conversion from 'meow' to 'woof', possible loss of data #include #include #include #pragma warning(pop) template void BM_Generator(benchmark::State& state) { RandomGenerator gen; gen.discard(1'000'000); while (state.KeepRunning()) { benchmark::DoNotOptimize(gen()); } } template void BM_Distribution(benchmark::State& state) { RandomGenerator gen; gen.discard(1'000'000); Distribution dist(0.0, 1.0); while (state.KeepRunning()) { benchmark::DoNotOptimize(dist(gen)); } } namespace b_r = boost::random; BENCHMARK(BM_Generator); BENCHMARK(BM_Generator); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK(BM_Distribution>); BENCHMARK_MAIN(); ```
Click to expand build/run incantations: ``` D:\GitHub\STL>set _CL_=/I C:\Users\stl\Downloads\boost_1_86_0 D:\GitHub\STL>"C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Auxiliary\Build\vcvarsall.bat" x64 ********************************************************************** ** Visual Studio 2022 Developer Command Prompt v17.12.0-pre.2.1 ** Copyright (c) 2022 Microsoft Corporation ********************************************************************** [vcvarsall.bat] Environment initialized for: 'x64' D:\GitHub\STL>cmake --preset x64 && cmake --build --preset x64 [...] [1021/1021] Linking CXX static library out\lib\amd64\libcpmtd0.lib D:\GitHub\STL>cmake -B out\bench -S benchmarks -G Ninja -DSTL_BINARY_DIR=out\x64 && cmake --build out\bench [...] [100/100] Linking CXX executable benchmark-priority_queue_push_range.exe D:\GitHub\STL>out\bench\benchmark-normal_distribution.exe 2024-10-04T15:11:50-07:00 Running out\bench\benchmark-normal_distribution.exe Run on (32 X 3394 MHz CPU s) CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 512 KiB (x16) L3 Unified 32768 KiB (x2) ------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------------- BM_Generator 4.08 ns 4.05 ns 165925926 BM_Generator 2.65 ns 2.62 ns 280000000 BM_Distribution> 13.1 ns 12.8 ns 56000000 BM_Distribution> 9.19 ns 9.21 ns 74666667 BM_Distribution> 5.81 ns 5.94 ns 100000000 BM_Distribution> 9.53 ns 9.52 ns 64000000 BM_Distribution> 12.6 ns 12.8 ns 56000000 BM_Distribution> 8.41 ns 8.37 ns 89600000 BM_Distribution> 5.23 ns 5.16 ns 112000000 BM_Distribution> 8.79 ns 8.89 ns 89600000 ```

I used VS 2022 17.12 Preview 2 on my 5950X. Table:

Benchmark Time
BM_Generator<std::mt19937_64> 4.08 ns
BM_Generator<b_r::mt19937_64> 2.65 ns
BM_Distribution<std::mt19937_64, std::normal_distribution<double>> 13.1 ns
BM_Distribution<std::mt19937_64, b_r::normal_distribution<double>> 9.19 ns
BM_Distribution<std::mt19937_64, std::uniform_real_distribution<double>> 5.81 ns
BM_Distribution<std::mt19937_64, b_r::uniform_real_distribution<double>> 9.53 ns
BM_Distribution<b_r::mt19937_64, std::normal_distribution<double>> 12.6 ns
BM_Distribution<b_r::mt19937_64, b_r::normal_distribution<double>> 8.41 ns
BM_Distribution<b_r::mt19937_64, std::uniform_real_distribution<double>> 5.23 ns
BM_Distribution<b_r::mt19937_64, b_r::uniform_real_distribution<double>> 8.79 ns

With std::mt19937_64 as the generator, Boost's normal_distribution is only 1.43x faster than ours.

And now our uniform_real_distribution is 1.64x faster than Boost's, so the new generate_canonical is indeed awesome.

I conclude that our underlying algorithm for normal_distribution is still suboptimal, but the generate_canonical improvement has substantially narrowed the overall perf gap. If we improved normal_distribution, we would likely outperform Boost, as uniform_real_distribution already does.