Closed JackAKirk closed 1 year ago
@andykaylor, FYI.
I think that -fno-fast-math should undo the effects of -ffast-math. To the extent that it doesn't, I would say we have a bug. @zahiraam has done a lot of work trying to get this behavior correct in the clang driver, but it's complicated.
The denormal handling, in particular, is tricky. The -ffast-math flag was originally introduced by gcc, and we've attempted to match the gcc behavior. This is the first place that denormal handling gets complicated. When compiling host code for x86-based targets, if fast-math is enabled, gcc will implicitly link crtfastmath.o if it can be found. This object file contains a static initializer that sets the FTZ and DAZ flags in the MXCSR register. I don't know if gcc does something similar for other targets. We set those flags differently in icpx, but I believe clang follows the gcc behavior. That was fine until the "denormal-fp-math" function attribute was introduced. Now we have that attribute being potentially different for each function, but the linking of crtfastmath.o (or the icpx call in main to set FTZ/DAZ) has a global effect. Of course, this is specific to code executed as part of the host process.
For offload execution, I'm not sure how consistently this is being handled. Matt Arsenault has been driving the implementation of the "denormal-fp-math" attribute because he needs it for AMDGPU targets. He recently introduced "dynamic" as a possible setting for "denormal-fp-math" and soon afterwards, the x86-base OpenCL runtime adopted this for compiling their device libraries. My understanding of the "denormal-fp-math" attribute is that it is supposed to be descriptive. That is, it doesn't actually set the processor denormal handling state, but it describes what we believe it is. The intent is to allow the compiler to optimize code and generate instructions in a way that is consistent with the denormal mode assumed. The "dynamic" mode means that the compiler shouldn't make any assumption, allowing the processor's control flags to be changed at runtime.
I have no idea what SYCL devices are doing with this attribute or how else they might be determining denormal handling behavior.
OK thanks for the info.
One task is to introduce a means such that this macro is not set here when -fno-fast-math
:
https://github.com/intel/llvm/blob/64bd50820262ded6fbd32d63bd96d5fdbf6861ac/clang/lib/Frontend/InitPreprocessor.cpp#L942
I'm not sure of the best way to do this right now: LangOpts.FastMath corresponds to -ffast-math
but I don't see a LangOpts corresponding to -fno-fast-math
.
But even if you switch off the __FAST_MATH__
macro there are still the other issues that I listed above.
If -fno-fast-math
isn't preventing __FAST_MATH__
from being defined, we have a problem somewhere, probably in the driver. This may seem counter-intuitive, but the clang driver code intercepts the driver-level -ffast-math
option and only passes it through to the clang_cc1 process if all of the contributing components of fast-math are set after all the driver-level options are processed. That's handled here: https://github.com/intel/llvm/blob/a230a62f9b20bf690d2ae9f77e09587a1b7a655d/clang/lib/Driver/ToolChains/Clang.cpp#L3296
If even one fast-math semantic mode is turned off, the driver shouldn't be passing the -ffast-math option to the main compiler process. You can see that working for the host compiler here: https://godbolt.org/z/Mj39nborr
It may be that the SYCL invocation is going through a different code path and not getting the handling above. If so, that's a bug.
If
-fno-fast-math
isn't preventing__FAST_MATH__
from being defined, we have a problem somewhere, probably in the driver. This may seem counter-intuitive, but the clang driver code intercepts the driver-level-ffast-math
option and only passes it through to the clang_cc1 process if all of the contributing components of fast-math are set after all the driver-level options are processed. That's handled here:If even one fast-math semantic mode is turned off, the driver shouldn't be passing the -ffast-math option to the main compiler process. You can see that working for the host compiler here: https://godbolt.org/z/Mj39nborr
It may be that the SYCL invocation is going through a different code path and not getting the handling above. If so, that's a bug.
I checked that. Compiling with -fsycl will pass through the same control flow path. -ffast-math with any of the options listed at the mentioned line above will not set the -ffast-math flag; same for -fno-fast-math.
I've just checked again on sycl tip in intel/llvm, using your test:
> clang++ -fsycl -fno-fast-math -ffast-math test.cpp -o res
> ./res
__FAST_MATH__ is defined.
> clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fno-fast-math -ffast-math test.cpp -o res
> ./res
__FAST_MATH__ is defined.
Just want to check we are talking about the same thing? Are you instead seeing:
__FAST_MATH__ is not defined.
?
Thanks
Oh yes sorry I think I understand: I reproduce what you have here:
> clang++ -fsycl -ffast-math -fno-fast-math test.cpp -o res
> ./res
__FAST_MATH__ is not defined.
As I wrote above the order of the flags affects what happens. I think the issue is that with the release when -ffast-math
is set by default effectively the order is -fno-fast-math -ffast-math
I think right?
This ordering allowed me to reproduce the specific reported cuda backend icpx test failures with intel/llvm sycl branch.
I am seeing the same thing than you for this combination of options. What I tried is: -ffast-math + fmath-errno , the output is: FAST_MATH is not defined and -fno-fast-mat: the output is: FAST_MATH is not defined. I need to check that. I didn't realize that the release version of the compiler sets the -ffast-math by default. I don't see that with the debug version of the compiler ( icpx otoh set -ffast-math by default in the debug compiler).
So, the release version of intel/llvm compiler doesn't set the -ffast-math by default. These are the various outputs I get: $ ./clang t1.cpp $ ./a.out __FAST_MATH__ is not defined.
$ ./clang -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined.
$ ./clang -ffast-math -fmath-errno t1.cpp $ ./a.out __FAST_MATH__ is not defined.
$ ./clang -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.
$ ./clang -ffast-math -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.
$ ./clang -fno-fast-math -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined. $ This seems to be working as expected! No?
$ ./clang -fno-fast-math -ffast-math t1.cpp $ ./a.out FAST_MATH is defined. $
Yes intel/llvm (clang++) does not set fast-math by default but icpx does. I reproduced the exact test failures reported in icpx (e.g. for this test https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/DeviceLib/math_test_marray_vec.cpp that will compile with icpx -fno-fast-math ...
) with the above command using clang++.
Note that I did not explicitly check icpx -fno-fast-math ...
defines FAST_MATH
, but I think it does based on this behaviour. Would be good to confirm this though.
Point is that if this is the case then I think we need -fno-fast-math -ffast-math
to not define FAST_MATH because I think icpx will always set -ffast-math
first.
Not quite. $ icpx t1.cpp $ ./a.out __FAST_MATH__ is defined.
$ icpx -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined.
$ icpx -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.
$ icpx -ffast-math -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.
$ icpx -fno-fast-math -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined. $
Not quite. $ icpx t1.cpp $ ./a.out FAST_MATH is defined.
$ icpx -ffast-math t1.cpp $ ./a.out FAST_MATH is defined.
$ icpx -fno-fast-math t1.cpp $ ./a.out FAST_MATH is not defined.
$ icpx -ffast-math -fno-fast-math t1.cpp $ ./a.out FAST_MATH is not defined.
$ icpx -fno-fast-math -ffast-math t1.cpp $ ./a.out FAST_MATH is defined. $
OK thanks. I'll go back and check icpx again.
Yes icpx also undefines __FAST_MATH__ for cuda backend with -fno-fast-math
. Also the XFAILS we had to set for these tests: https://github.com/intel/llvm/pull/9135 are now passing in the latest icpx. So whatever was the problem seems to now be fixed.
Note FYI I am still seeing that icpx is emitting slightly different compiler flags (compile with -###) with -fno-fast-math
compared to clang++ (with not flags). The difference is that the icpx -fno-fast-math
uses "-ffp-contract=fast" whereas clang++
uses "-ffp-contract=on" (whether or not compiling for cuda). This doesn't seem to be leading to any failures of the tests from #9135 in the cuda backend though (but users will perhaps not be expecting "-ffp-contract=fast" when compiling with -fno-fast-math
).
There are some tests where the fast-math flag only affects cuda and so were missed by #9135. I will open a PR to update these tests.
Yes icpx also undefines FAST_MATH for cuda backend with
-fno-fast-math
. Also the XFAILS we had to set for these tests: #9135 are now passing in the latest icpx. So whatever was the problem seems to now be fixed.Note FYI I am still seeing that icpx is emitting slightly different compiler flags (compile with -###) with
-fno-fast-math
compared to clang++ (with not flags). The difference is that theicpx -fno-fast-math
uses "-ffp-contract=fast" whereasclang++
uses "-ffp-contract=on" (whether or not compiling for cuda). This doesn't seem to be leading to any failures of the tests from #9135 in the cuda backend though (but users will perhaps not be expecting "-ffp-contract=fast" when compiling with-fno-fast-math
).There are some tests where the fast-math flag only affects cuda and so were missed by #9135. I will open a PR to update these tests.
Thanks. I will check on your note above.
I think there is a need for a means to provide users the ability to completely undo the effects of passing
-ffast-math
flag to the compiler if icpx is going to be using-ffast-math
by default. I'm not sure what the best solution to this is right now. I'm opening this issue to raise discussion.I think we need a solution to this for the following reasons.
First, note that recently a bunch of e2e tests were adjusted with
mathflags
which added the-fno-fast-math
flag (see this PR which undoes this for some tests: https://github.com/intel/llvm/pull/9723). I believe that this was to allow tests to pass when compiling with icpx which uses-ffast-math
always. A problem is that-fno-fast-math
does not undo-ffast-math
(see the relevant section here https://github.com/intel/llvm/blob/0267c1b237409fb5ffc28c0511a30153c29fe29f/clang/docs/UsersManual.rst#L1462) and this "fix" of adding-fno-fast-math
did not work in the nvptx64 backend for these tests (they still failed). I also find that there is at least one test that still fails for at least one opencl cpu device (scalar_math.cpp). I find that in both opencl cpu and nvptx64 backends, although you get different results when just passing-ffast-math
to the compiler (e.g. different fp contract behaviour) the effect of adding-fno-fast-math
in front of-ffast-math
to the output of-###
is you get an extra"-fdenormal-fp-math-f32=ieee,ieee"
. So-fno-fast-math
does not undo any of the math flags that are added by-ffast-math
.Since
-ffast-math
and-fno-fast-math
also don't commute, sometimes the precision sensitive tests that dont like-ffast-math
pass if I pass-fno-fast-math
before-ffast-math
, when they failed if the flags are passed in the opposite order.In either ordering case,In the-fno-fast-math -ffast-math
order, I find that theFAST_MATH
macro is defined, which means that-fno-fast-math
(when used in combination with-ffast-math
) results in sycl math functions calling sycl native math functions. I believe this means that icpx currently has no way of calling sycl math functions for the cases where corresponding sycl native math functions exist. I suppose this should probably be considered a bug, since technically this would mean icpx cannot implement this part of the sycl math function specification.