intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.23k stars 735 forks source link

[SYCL][CLANG] Request a means to completely undo `-ffast-math` when it is used by default #9759

Closed JackAKirk closed 1 year ago

JackAKirk commented 1 year ago

I think there is a need for a means to provide users the ability to completely undo the effects of passing -ffast-math flag to the compiler if icpx is going to be using -ffast-math by default. I'm not sure what the best solution to this is right now. I'm opening this issue to raise discussion.

I think we need a solution to this for the following reasons.

First, note that recently a bunch of e2e tests were adjusted with mathflags which added the -fno-fast-math flag (see this PR which undoes this for some tests: https://github.com/intel/llvm/pull/9723). I believe that this was to allow tests to pass when compiling with icpx which uses -ffast-math always. A problem is that -fno-fast-math does not undo -ffast-math (see the relevant section here https://github.com/intel/llvm/blob/0267c1b237409fb5ffc28c0511a30153c29fe29f/clang/docs/UsersManual.rst#L1462) and this "fix" of adding -fno-fast-math did not work in the nvptx64 backend for these tests (they still failed). I also find that there is at least one test that still fails for at least one opencl cpu device (scalar_math.cpp). I find that in both opencl cpu and nvptx64 backends, although you get different results when just passing -ffast-math to the compiler (e.g. different fp contract behaviour) the effect of adding -fno-fast-math in front of -ffast-math to the output of -### is you get an extra "-fdenormal-fp-math-f32=ieee,ieee". So -fno-fast-math does not undo any of the math flags that are added by -ffast-math.

Since -ffast-math and -fno-fast-math also don't commute, sometimes the precision sensitive tests that dont like -ffast-math pass if I pass -fno-fast-math before -ffast-math, when they failed if the flags are passed in the opposite order. In either ordering case, In the -fno-fast-math -ffast-math order, I find that the FAST_MATH macro is defined, which means that -fno-fast-math (when used in combination with -ffast-math) results in sycl math functions calling sycl native math functions. I believe this means that icpx currently has no way of calling sycl math functions for the cases where corresponding sycl native math functions exist. I suppose this should probably be considered a bug, since technically this would mean icpx cannot implement this part of the sycl math function specification.

bader commented 1 year ago

@andykaylor, FYI.

andykaylor commented 1 year ago

I think that -fno-fast-math should undo the effects of -ffast-math. To the extent that it doesn't, I would say we have a bug. @zahiraam has done a lot of work trying to get this behavior correct in the clang driver, but it's complicated.

The denormal handling, in particular, is tricky. The -ffast-math flag was originally introduced by gcc, and we've attempted to match the gcc behavior. This is the first place that denormal handling gets complicated. When compiling host code for x86-based targets, if fast-math is enabled, gcc will implicitly link crtfastmath.o if it can be found. This object file contains a static initializer that sets the FTZ and DAZ flags in the MXCSR register. I don't know if gcc does something similar for other targets. We set those flags differently in icpx, but I believe clang follows the gcc behavior. That was fine until the "denormal-fp-math" function attribute was introduced. Now we have that attribute being potentially different for each function, but the linking of crtfastmath.o (or the icpx call in main to set FTZ/DAZ) has a global effect. Of course, this is specific to code executed as part of the host process.

For offload execution, I'm not sure how consistently this is being handled. Matt Arsenault has been driving the implementation of the "denormal-fp-math" attribute because he needs it for AMDGPU targets. He recently introduced "dynamic" as a possible setting for "denormal-fp-math" and soon afterwards, the x86-base OpenCL runtime adopted this for compiling their device libraries. My understanding of the "denormal-fp-math" attribute is that it is supposed to be descriptive. That is, it doesn't actually set the processor denormal handling state, but it describes what we believe it is. The intent is to allow the compiler to optimize code and generate instructions in a way that is consistent with the denormal mode assumed. The "dynamic" mode means that the compiler shouldn't make any assumption, allowing the processor's control flags to be changed at runtime.

I have no idea what SYCL devices are doing with this attribute or how else they might be determining denormal handling behavior.

JackAKirk commented 1 year ago

OK thanks for the info. One task is to introduce a means such that this macro is not set here when -fno-fast-math: https://github.com/intel/llvm/blob/64bd50820262ded6fbd32d63bd96d5fdbf6861ac/clang/lib/Frontend/InitPreprocessor.cpp#L942

I'm not sure of the best way to do this right now: LangOpts.FastMath corresponds to -ffast-math but I don't see a LangOpts corresponding to -fno-fast-math.

But even if you switch off the __FAST_MATH__ macro there are still the other issues that I listed above.

andykaylor commented 1 year ago

If -fno-fast-math isn't preventing __FAST_MATH__ from being defined, we have a problem somewhere, probably in the driver. This may seem counter-intuitive, but the clang driver code intercepts the driver-level -ffast-math option and only passes it through to the clang_cc1 process if all of the contributing components of fast-math are set after all the driver-level options are processed. That's handled here: https://github.com/intel/llvm/blob/a230a62f9b20bf690d2ae9f77e09587a1b7a655d/clang/lib/Driver/ToolChains/Clang.cpp#L3296

If even one fast-math semantic mode is turned off, the driver shouldn't be passing the -ffast-math option to the main compiler process. You can see that working for the host compiler here: https://godbolt.org/z/Mj39nborr

It may be that the SYCL invocation is going through a different code path and not getting the handling above. If so, that's a bug.

zahiraam commented 1 year ago

If -fno-fast-math isn't preventing __FAST_MATH__ from being defined, we have a problem somewhere, probably in the driver. This may seem counter-intuitive, but the clang driver code intercepts the driver-level -ffast-math option and only passes it through to the clang_cc1 process if all of the contributing components of fast-math are set after all the driver-level options are processed. That's handled here:

https://github.com/intel/llvm/blob/a230a62f9b20bf690d2ae9f77e09587a1b7a655d/clang/lib/Driver/ToolChains/Clang.cpp#L3296

If even one fast-math semantic mode is turned off, the driver shouldn't be passing the -ffast-math option to the main compiler process. You can see that working for the host compiler here: https://godbolt.org/z/Mj39nborr

It may be that the SYCL invocation is going through a different code path and not getting the handling above. If so, that's a bug.

I checked that. Compiling with -fsycl will pass through the same control flow path. -ffast-math with any of the options listed at the mentioned line above will not set the -ffast-math flag; same for -fno-fast-math.

JackAKirk commented 1 year ago

I've just checked again on sycl tip in intel/llvm, using your test:

> clang++ -fsycl -fno-fast-math -ffast-math test.cpp -o res
> ./res
__FAST_MATH__ is defined.
> clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fno-fast-math -ffast-math test.cpp -o res
> ./res
__FAST_MATH__ is defined.

Just want to check we are talking about the same thing? Are you instead seeing:

__FAST_MATH__ is not defined.

?

Thanks

JackAKirk commented 1 year ago

Oh yes sorry I think I understand: I reproduce what you have here:

> clang++ -fsycl -ffast-math -fno-fast-math test.cpp -o res
> ./res
__FAST_MATH__ is not defined.

As I wrote above the order of the flags affects what happens. I think the issue is that with the release when -ffast-math is set by default effectively the order is -fno-fast-math -ffast-math I think right? This ordering allowed me to reproduce the specific reported cuda backend icpx test failures with intel/llvm sycl branch.

zahiraam commented 1 year ago

I am seeing the same thing than you for this combination of options. What I tried is: -ffast-math + fmath-errno , the output is: FAST_MATH is not defined and -fno-fast-mat: the output is: FAST_MATH is not defined. I need to check that. I didn't realize that the release version of the compiler sets the -ffast-math by default. I don't see that with the debug version of the compiler ( icpx otoh set -ffast-math by default in the debug compiler).

zahiraam commented 1 year ago

So, the release version of intel/llvm compiler doesn't set the -ffast-math by default. These are the various outputs I get: $ ./clang t1.cpp $ ./a.out __FAST_MATH__ is not defined.

$ ./clang -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined.

$ ./clang -ffast-math -fmath-errno t1.cpp $ ./a.out __FAST_MATH__ is not defined.

$ ./clang -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.

$ ./clang -ffast-math -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.

$ ./clang -fno-fast-math -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined. $ This seems to be working as expected! No?

JackAKirk commented 1 year ago

$ ./clang -fno-fast-math -ffast-math t1.cpp $ ./a.out FAST_MATH is defined. $

Yes intel/llvm (clang++) does not set fast-math by default but icpx does. I reproduced the exact test failures reported in icpx (e.g. for this test https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/DeviceLib/math_test_marray_vec.cpp that will compile with icpx -fno-fast-math ...) with the above command using clang++.

Note that I did not explicitly check icpx -fno-fast-math ... defines FAST_MATH, but I think it does based on this behaviour. Would be good to confirm this though.

Point is that if this is the case then I think we need -fno-fast-math -ffast-math to not define FAST_MATH because I think icpx will always set -ffast-math first.

zahiraam commented 1 year ago

Not quite. $ icpx t1.cpp $ ./a.out __FAST_MATH__ is defined.

$ icpx -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined.

$ icpx -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.

$ icpx -ffast-math -fno-fast-math t1.cpp $ ./a.out __FAST_MATH__ is not defined.

$ icpx -fno-fast-math -ffast-math t1.cpp $ ./a.out __FAST_MATH__ is defined. $

JackAKirk commented 1 year ago

Not quite. $ icpx t1.cpp $ ./a.out FAST_MATH is defined.

$ icpx -ffast-math t1.cpp $ ./a.out FAST_MATH is defined.

$ icpx -fno-fast-math t1.cpp $ ./a.out FAST_MATH is not defined.

$ icpx -ffast-math -fno-fast-math t1.cpp $ ./a.out FAST_MATH is not defined.

$ icpx -fno-fast-math -ffast-math t1.cpp $ ./a.out FAST_MATH is defined. $

OK thanks. I'll go back and check icpx again.

zahiraam commented 1 year ago

http://cmplrexplorer.intel.com/z/jMGa8M

JackAKirk commented 1 year ago

Yes icpx also undefines __FAST_MATH__ for cuda backend with -fno-fast-math. Also the XFAILS we had to set for these tests: https://github.com/intel/llvm/pull/9135 are now passing in the latest icpx. So whatever was the problem seems to now be fixed.

Note FYI I am still seeing that icpx is emitting slightly different compiler flags (compile with -###) with -fno-fast-math compared to clang++ (with not flags). The difference is that the icpx -fno-fast-math uses "-ffp-contract=fast" whereas clang++ uses "-ffp-contract=on" (whether or not compiling for cuda). This doesn't seem to be leading to any failures of the tests from #9135 in the cuda backend though (but users will perhaps not be expecting "-ffp-contract=fast" when compiling with -fno-fast-math).

There are some tests where the fast-math flag only affects cuda and so were missed by #9135. I will open a PR to update these tests.

zahiraam commented 1 year ago

Yes icpx also undefines FAST_MATH for cuda backend with -fno-fast-math. Also the XFAILS we had to set for these tests: #9135 are now passing in the latest icpx. So whatever was the problem seems to now be fixed.

Note FYI I am still seeing that icpx is emitting slightly different compiler flags (compile with -###) with -fno-fast-math compared to clang++ (with not flags). The difference is that the icpx -fno-fast-math uses "-ffp-contract=fast" whereas clang++ uses "-ffp-contract=on" (whether or not compiling for cuda). This doesn't seem to be leading to any failures of the tests from #9135 in the cuda backend though (but users will perhaps not be expecting "-ffp-contract=fast" when compiling with -fno-fast-math).

There are some tests where the fast-math flag only affects cuda and so were missed by #9135. I will open a PR to update these tests.

Thanks. I will check on your note above.