[BUG] DoNotOptimize unpredictable on ternary conditionals

Describe the bug DoNotOptimize seems to have unpredictable behavior on ternary conditionals.

I'm trying to benchmark the performance of manual flushing to zero on denormals and I define a macro like this:

#include <cmath>
#include <cfloat>

#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))

which basically flushes any denormal floats (i.e. below FLT_MIN) to 0 to prevent performance degradation caused by denormal numbers on x86.

This macro is used in the following benchmark:

static void
flush_32(benchmark::State& state)
{
  float mem = FLT_MIN;
  for (auto _ : state) {
    benchmark::DoNotOptimize(mem = 0.999f * mem);
    benchmark::DoNotOptimize(FLUSHF(mem));
  }
}
BENCHMARK(flush_32);

When compiling using g++ with -O3 optimization, FLUSHF doesn't seem to be working correctly and the flushing does not happen, which results in slower execution.

To see this in action, try compile the following with -O2, -O3, and clang++ and see the performance difference (you might need an x86 machine)

#include <benchmark/benchmark.h>
#include <cmath>
#include <cfloat>

#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))

static void
flush_32(benchmark::State& state)
{
  float mem = FLT_MIN;
  for (auto _ : state) {
    benchmark::DoNotOptimize(mem = 0.999f * mem);
    benchmark::DoNotOptimize(FLUSHF(mem));
  }
}
BENCHMARK(flush_32);
BENCHMARK_MAIN();

$ g++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_32         36.9 ns         36.9 ns     18957274
$ g++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_32         6.06 ns         6.05 ns    101055135
$ clang++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_32         5.30 ns         5.30 ns    102412118
$ clang++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_32         5.23 ns         5.23 ns    101888597

Because FLUSHF is partially optimized away for gcc with -O3, it runs much slower.

Strangely, the flushing works fine with double. I also tried the original escape function from the video mentioned in the comment.

template <class Tp>
inline void escape(Tp& value)
{
  asm volatile("" : : "g"(value) : "memory");
}

and it works correctly. Try compile the following to see the problem:

#include <benchmark/benchmark.h>
#include <cmath>
#include <cfloat>

#define FLUSH(x) ((x) = fabs(x)<DBL_MIN ? 0 : (x))
#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))

template <class Tp>
inline void escape(Tp& value)
{
  asm volatile("" : : "g"(value) : "memory");
}

static void
flush_64(benchmark::State& state)
{
  double mem = FLT_MIN;
  for (auto _ : state) {
    benchmark::DoNotOptimize(mem = 0.999 * mem);
    benchmark::DoNotOptimize(FLUSH(mem));
  }
}
BENCHMARK(flush_64);

static void
flush_32(benchmark::State& state)
{
  float mem = FLT_MIN;
  for (auto _ : state) {
    benchmark::DoNotOptimize(mem = 0.999f * mem);
    benchmark::DoNotOptimize(FLUSHF(mem));
  }
}
BENCHMARK(flush_32);

static void
escape_32(benchmark::State& state)
{
  float mem = FLT_MIN;
  for (auto _ : state) {
    benchmark::DoNotOptimize(mem = 0.999f * mem);
    escape(FLUSHF(mem));
  }
}
BENCHMARK(escape_32);

BENCHMARK_MAIN();

$ g++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_64        0.606 ns        0.605 ns    995985269
flush_32         37.5 ns         37.5 ns     18310441
escape_32       0.773 ns        0.772 ns    901474201
$ g++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_64         5.94 ns         5.94 ns    101705109
flush_32         6.12 ns         6.11 ns    115467957
escape_32        4.65 ns         4.64 ns    152872058
$ clang++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_64         5.24 ns         5.24 ns     99614101
flush_32         5.34 ns         5.34 ns    130798955
escape_32        5.35 ns         5.35 ns    129643413
$ clang++ -O2 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_64         5.33 ns         5.24 ns     98349666
flush_32         5.38 ns         5.27 ns    132894039
escape_32        5.44 ns         5.44 ns    129066762

The anomaly of flush_32 with gcc -O3 is apparently a problem.

Is there a way to fix this problem without manually introducing another non-portable function to escape the optimization?

System Which OS, compiler, and compiler version are you using:

OS: Linux 5.12.12
Compiler and version: g++ 11.1.0, clang++ 12.0.0

Expected behavior DoNotOptimize works predictably on ternary conditionals.

This is strange. After I replace every instance of DoNotOptimize with escape, flush_32 always runs around 0.3ns slower no matter how many times I rerun it, even though it is exactly the same as escape_32

#include <benchmark/benchmark.h>
#include <cmath>
#include <cfloat>

#define FLUSH(x) ((x) = fabs(x)<DBL_MIN ? 0 : (x))
#define FLUSHF(x) ((x) = fabsf(x)<FLT_MIN ? 0 : (x))

template <class Tp>
inline void escape(Tp& value)
{
  asm volatile("" :: "g"(value) : "memory");
}

static void
flush_64(benchmark::State& state)
{
  double mem = FLT_MIN;
  for (auto _ : state) {
    escape(mem = 0.999 * mem);
    escape(FLUSH(mem));
  }
}
BENCHMARK(flush_64);

static void
flush_32(benchmark::State& state)
{
  float mem = FLT_MIN;
  for (auto _ : state) {
    escape(mem = 0.999f * mem);
    escape(FLUSHF(mem));
  }
}
BENCHMARK(flush_32);

static void
escape_32(benchmark::State& state)
{
  float mem = FLT_MIN;
  for (auto _ : state) {
    escape(mem = 0.999f * mem);
    escape(FLUSHF(mem));
  }
}
BENCHMARK(escape_32);

BENCHMARK_MAIN();

$ g++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_64        0.693 ns        0.693 ns    858596482
flush_32         1.01 ns         1.01 ns    691820352
escape_32       0.689 ns        0.688 ns   1000000000

While if we set the constraint to "r" instead, which is used by folly,

template <class Tp>
inline void escape(Tp& value)
{
  asm volatile("" :: "r"(value));
}

the execution time would be the same

$ g++ -O3 bug.cc -lbenchmark
$ ./a.out
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
flush_64        0.678 ns        0.677 ns    884857534
flush_32        0.678 ns        0.677 ns   1000000000
escape_32       0.678 ns        0.677 ns    994171725

google / benchmark

[BUG] DoNotOptimize unpredictable on ternary conditionals #1188