martinus / nanobench

Simple, fast, accurate single-header microbenchmarking functionality for C++11/14/17/20
https://nanobench.ankerl.com
MIT License
1.43k stars 82 forks source link

Randomly unstable results on Alder Lake, Win 11 #71

Closed Ok23 closed 1 year ago

Ok23 commented 2 years ago

Can give hugely (about 100 times slower) differ results when run AVX2 code

int main(int argc, char ** argv)
{
    alignas(32) float res[8];
    float * mem = static_cast<float *>(operator new(256, std::align_val_t(32)));
    float * mulmem = static_cast<float *>(operator new(256, std::align_val_t(32)));

    Bench().run("simd", [&]()
    {
        __m256 simdvec_ = _mm256_loadu_ps(mem);
        __m256 simdvecmul_ = _mm256_loadu_ps(mulmem);
        simdvec_ = _mm256_mul_ps(simdvec_, simdvecmul_);
        _mm256_storeu_ps(res, simdvec_);
    });
    return res[0];
}

Sometimes when i rebuild writes about 0.5 ns/op, and when i relaunch writes about 29 ns/op, i think it related to Windows 11 thread manager or/and because my processor is Alder Lake i5-12600k with E-cores.

Google Benchmark seems give more consistent results about 0.23 ns

martinus commented 2 years ago

I'd up the minEpochTime, e.g. like so:

using namespace std::literals;
Bench().minEpochTime(1s).run(...)

see https://nanobench.ankerl.com/reference.html#classankerl_1_1nanobench_1_1Bench_1a15fc41385e77877d7568797fbadba5f9

Ok23 commented 2 years ago

Did not help. I measured on what cores it runs, and did not find any relations between cores and results. Every time i execute program results are 27ns or 0.5ns

martinus commented 2 years ago

Maybe the compiler optimizes some of your choice away? Try to modify the input in the loop, e.g add a number each time, and make sure to keep each result, e.g. sum up the result and use doNotOptimizeAway

Ok23 commented 2 years ago

It seems related to avx2 mul (_mm256_mul_ps) and add (_mm256_add_ps) instructions and not related to compiler optimisations because results absolutely different between running same binary

martinus commented 2 years ago

Just to be safe I'd do something like this, modify input and make sure output is not optimized away, and icnrease minimum epoch time: https://godbolt.org/z/eEb78vP5n

Ok23 commented 2 years ago

Just to be safe I'd do something like this, modify input and make sure output is not optimized away, and icnrease minimum epoch time: godbolt.org/z/eEb78vP5n

no, it doesn't work, it still jumps beween 5ns and 30ns. I already tried to block optimizer

Lectem commented 1 year ago

Most likely due to https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html Google benchmark probably runs for much longer and frequency change gets "hidden" in the accumulation of timings