Open illwieckz opened 1 day ago
@andykaylor here:
you said:
In clang, we do apply fast-math to the intriniscs. There are some problems with that. […]
Does clang also applies some implicit optimizations to generated intrinsics as well (like -fno-trapping-math
), optimizations that would not be used when not generating intrinsics?
By generated intrinsics I mean clang deciding to use divss
on some xmm
register when doing a single division on standalone floats like the user writing a = b / c;
without the user using any explicit function from immintrin.h
.
This is something I found when tracking down #118152 on my end by fiddling with compiler flags:
I was hoping that using feenableexcept()
would help me catch some float computation errors that Clang 19 produces but Clang 18 doesn't, in hope to pinpoint some sample of code that reproduce #118152.
But I found this instead.
So @slipher pointed out to me the usage of -fno-trapping-math
is the default, and indeed I found this:
-f[no-]trapping-math
Control floating point exception behavior.
-fno-trapping-math
allows optimizations that assume that floating point operations cannot generate traps such as divide-by-zero, overflow and underflow.
- The option
-ftrapping-math
behaves identically to-ffp-exception-behavior=strict
.- The option
-fno-trapping-math
behaves identically to-ffp-exception-behavior=ignore
. This is the default.
The quote is from: https://clang.llvm.org/docs/UsersManual.html
It means the default behavior makes feenableexcept()
unusable on architectures featuring SSE.
But since it's a glibc function, maybe clang cannot do anything about it…
This code doesn't have FENV_ACCESS on for most of the flags you demonstrated, meaning feenableexcept works or exceptions are observed. We REALLY need to have a warning if any of these functions are used without it enabled
@andykaylor here:
you said:
In clang, we do apply fast-math to the intriniscs. There are some problems with that. […]
Does clang also applies some implicit optimizations to generated intrinsics as well (like
-fno-trapping-math
), optimizations that would not be used when not generating intrinsics?By generated intrinsics I mean clang deciding to use
divss
on somexmm
register when doing a single division on standalone floats like the user writinga = b / c;
without the user using any explicit function fromimmintrin.h
.
In that comment I was referring to the target-specific intrinsics declared in immintrin.h and the related headers that it includes. The use of divss is something else. That's just an implementation detail determined by the backend.
The problem you're running into here is that by default clang does not allow access to the floating-point environment. Unless you have doene something to enable access to the floating-point environment (such as #pragma FENV_ACCESS ON
or compiled with -ffp-model=strict
or -ftrapping-math
) the optimizer will assume the default floating-point environment, meaning floating-point exceptions are unmasked and the default rounding mode is used.
This isn't really specific to SSE. The same assumptions are made for any target, but it may be exposed in different ways on different targets.
It means the default behavior makes
feenableexcept()
unusable on architectures featuring SSE.But since it's a glibc function, maybe clang cannot do anything about it…
You need to use #pragma STDC FENV_ACCESS ON
(or equivalent command-line flags) to use the function. Even in standard C, being in a non-default floating-point environment without setting that pragma is undefined behavior, and all compilers that I'm aware of retain that sense of you need to opt in to observing the floating-point environment.
This code doesn't have FENV_ACCESS on for most of the flags you demonstrated, meaning feenableexcept works or exceptions are observed. We REALLY need to have a warning if any of these functions are used without it enabled
Oh! That's good to know! I have seen this pragma used in some macOS example but most Linux/BSD documentation seem to not mention it (example, example, example).
A warning would help a lot indeed.
I guess I got all the answers I needed.
Hi, using this sample code:
By compiling it with
-O2
, the compiled code raises somedivision by zero
exception when executed:My guess is that the SSE optimized code raises division by zero exceptions from unused fields of the xmm registers, despite we discard those results.
The requirement of
-O2
requiring either-ftrapping-math
or-ffast-math
is curious since-ffast-math
sets-fno-trapping-math
.Also, I thought
-ftrapping-math
was the default, and that-fno-trapping-math
is part of-ffast-math
, but just using-O2
behaves like if-fno-trapping-math
is used, so it behaves like if part of-ffast-math
was enabled anyway… But maybe-ffast-math
just modifies other behaviors that make-ftrapping-math
or-fno-trapping-math
irrelevant.Using the Godbolt's compiler explorer (here Clang 19.1.0), it only works with
-O2 -ffast-math
(or depreacted-Ofast
):-O0
-Os
-Os -fno-trapping-math
-Os -ftrapping-math
-O1
-O2
-O2 -fno-trapping-math
-O2 -ftrapping-math
-ffast-math
-O2 -ffast-math
-O2 -ffast-math -ftrapping-math
-O2 -ffast-math -fno-trapping-math
-Ofast
See: https://godbolt.org/z/395c8nMef And (with
-ftrapping-math
added): https://godbolt.org/z/czbGTjs6E And (with-fno-trapping-math
added): https://godbolt.org/z/zYr5Pba3WWith 32-bit i686 I reproduce the bug when using SSE but get no bug when not using SSE.
-m32 -msse
-m32 -mno-sse
-Os
-O1
-O2
-ffast-math
-O2 -ffast-math
See 32-bit i686 with SSE I get the same failure: https://godbolt.org/z/199eqhzGh And 32-bit i686 build without SSE I get no failure: https://godbolt.org/z/sfd9TsTj4
On my end on Ubuntu 24.04 with amd64, I get same results with clang 19.1.5.
On Ubuntu 24.04 with amd64 and different versions of the clang compiler, I only get it working with clang 13 and 14, every later version breaks it:
-O0
-O1
-O2
-O2 -ftrapping-math
-ffast-math
-O2 -ffast-math
On GCC 14.02 I get none of those issues (no one false positive
division by zero
error is raised whatever the compiler flags being used: https://godbolt.org/z/rxMWM68McNote: Disabling SSE on amd64 just produces garbage computation (
1.0/1.0
gives0.0
), but I don't know if that makes sense to disable SSE on amd64. GCC produces the same garbage (1.0/1.0
giving0.0
). Though I'm surprised to not get a warning if that's not legit to do, also I'm surprised the generated code runs if that's not legit to do. See: https://godbolt.org/z/77PYGTc5b (Clang) and: https://godbolt.org/z/cvW4Er3Kd (GCC).