Closed tobiasgrosser closed 2 years ago
I believe this should set the right ftz/daz flags since fa7cd549d604bfd8f9dce5d649a19720cbc39cca
Depends on the target, but crtfastmath.o should set both DAZ and FTZ. E.g. x86-64.
NoSignedZeros is orthogonal to DAZ and FTZ. Except maybe for AMDGPU (I think, not sure) and other targets that have special flushing modes.
Also, Andy Kaylor just suggested an LLVM specific way to set these flags at runtime through compiler_rt.
Should FTZ also be set, e.g., -fno-signed-zero
is enabled (e.g. via -ffast-math
) ?
The pmmintrin.h header for SSE3 (included in clang) has the macro _MM_SET_DENORMALS_ZERO_MODE that sets DAZ. It doesn't require any SSE3-functionality to do so, only _mm_getcsr and _mm_setcsr, which are part of basic SSE. This might be an alternative when crtfastmath is not available (e.g. on Mac OS X).
r165240 makes clang link crtfastmath.o if it's available (only on linux for now).
This is a neat trick. GCC links in crtfastmath.o (part of libgcc) which sets the necessary bits. We should do the same in the clang driver and provide crtfastmath.o with compiler-rt.
This was resolved by r165240 (aka 058666a8d02f5cd348150862a3401c9c4bd0b4d0) back in 2012, not sure why it wasn't closed then.
Extended Description
$gcc jacobi_1d.DenormalsAreZero.c -O3 $time ./a.out real 0m20.164s
$gcc jacobi_1d.DenormalsAreZero.c -O3 -ffast-math $time ./a.out real 0m0.357s
$clang jacobi_1d.DenormalsAreZero.c -O3 $time ./a.out real 0m36.660s
$clang jacobi_1d.DenormalsAreZero.c -O3 -ffast-math $time ./a.out real 0m36.431s
As can be seen the gcc produced binary is a lot faster than clang in -ffast-mode (besides being a little bit faster in general). This is not caused by better optimizations, but because gcc links in a small function into the resulting binary, which sets the DAZ register.
From [1]:
"DAZ tells the CPU to force all Denormals to zero. A Denormal is a number that is so small that FPU can't renormalize it due to limited exponent ranges. They're just like normal numbers, but they take considerably longer to process. Note that not all processors support DAZ."
The test case happens to calculate a lot of these close-to-zero values. Hence, setting the register has a big impact.
[1] http://softpixel.com/~cwright/programming/simd/sse.php