Open boomshroom opened 10 months ago
I believe the way the constrained intrinsics are currently defined, the rounding mode passed to the constrained intrinsic must match the current global rounding mode. The intrinsics don't force the compiler to never read the global rounding mode.
Adding @andykaylor to confirm my understanding.
I believe the way the constrained intrinsics are currently defined, the rounding mode passed to the constrained intrinsic must match the current global rounding mode. The intrinsics don't force the compiler to never read the global rounding mode.
Adding @andykaylor to confirm my understanding.
Why pray that they're the same when you have the option to force them to be the same? They're specified as UB if they aren't executed with the specified mode, so when given the option between a chance to not be UB and a guarantee to not be UB with no extra instructions, in what situation would the former be preferable?
Using a specific non-dynamic rounding mode which does not match the actual rounding mode at runtime results in undefined behavior.
It doesn't say the environment's rounding mode, and it's applied on the instruction rather than as a function attribute, so I'd assume it only cares about the operation's rounding mode.
Why pray that they're the same when you have the option to force them to be the same? They're specified as UB if they aren't executed with the specified mode, so when given the option between a chance to not be UB and a guarantee to not be UB with no extra instructions, in what situation would the former be preferable?
If I recall correctly AVX512 only has static rounding for 512 bit vectors and not for 256 and 128 bit vectors. The compiler would need to turn all 128 and 256 operations into 512 bit in order to guarantee static rounding would always be used with AVX512. This could have performance implications since you're wasting ALU resources not doing useful work on the padded elements.
If we don't widen the 128 and 256 operations to 512 bits and guarantee to always use the static mode, then we have an inconsistent programming model. The programmer would be required to keep the global rounding mode accurate and you wouldn't save any instructions from using the static rounding mode. Or the programmer would have to know exactly how the compiler will implement any operation using the constrained intrinsics to know for sure the global mode wouldn't be used.
I think static rounding also prevents using the reg-mem form of arithmetic instructions for AVX512.
I believe the way the constrained intrinsics are currently defined, the rounding mode passed to the constrained intrinsic must match the current global rounding mode. The intrinsics don't force the compiler to never read the global rounding mode.
Adding @andykaylor to confirm my understanding.
The rounding mode argument is intended to allow additional optimization when the compiler can prove the rounding mode, such as if it has just seen a call to fesetround() or if the STDC FENV_ROUND pragma has been used. The rounding mode can also be set to the default rounding mode if we're inlining a function that wasn't compiled with strictfp constraints, or if we're only using the constrained intrinsics to limit exception behavior.
In general, if the rounding mode argument to the constrained intrinsic is something other than dynamic and it doesn't match the dynamic rounding mode, it means either the user's code changed the rounding mode when it wasn't allowed to or the compiler did something wrong.
The main reason the constrained intrinsic rounding mode isn't defined as controlling the rounding mode is that it frees us from having to insert explicit instructions to set the rounding mode all over the place.
I think it's OK to generate instructions with a static rounding mode encoded if the constrained intrinsic isn't using dynamic rounding mode. We just shouldn't be required to do so.
If I recall correctly AVX512 only has static rounding for 512 bit vectors and not for 256 and 128 bit vectors. The compiler would need to turn all 128 and 256 operations into 512 bit in order to guarantee static rounding would always be used with AVX512. This could have performance implications since you're wasting ALU resources not doing useful work on the padded elements.
For vector arguments, yes, static rounding modes are only supported for 512-bit vectors. However, static rounding modes are also supported for scalar operations. It's possible to generate scalar float operations with static rounding modes on x86 using intrinsics like _mm_div_round_ss
. On RISC-V though, no such intrinsics exist for scalar operations with static rounding mode and the only option is inline assembly.
Personally, I think static rounding modes should've been the default from the start, but it's hard to change history like that. There is also the option of representing the instructions as having an explicit dependency on the environment and constant folding sets and resets in, but I'm not sure how complex that would be to implement.
On RISC-V though, no such intrinsics exist for scalar operations with static rounding mode and the only option is inline assembly.
I do plan on tackling this next year. It took far longer than I expected to get scalar bitmanip integer intrinsics approved through RISC-V International. That finally happened a couple weeks ago. I couldn't devote the time to having 2 different intrinsic proposals in flight at once so I put off scalar floating point intrinsics.
I do plan on tackling this next year. It took far longer than I expected to get scalar bitmanip integer intrinsics approved through RISC-V International. That finally happened a couple weeks ago. I couldn't devote the time to having 2 different intrinsic proposals in flight at once so I put off scalar floating point intrinsics.
Oh! I'll be interested to see that. Good luck!
@llvm/issue-subscribers-backend-risc-v
Author: Angelo Bulfone (boomshroom)
@llvm/issue-subscribers-backend-x86
Author: Angelo Bulfone (boomshroom)
I think it's highly relied on user scenarios. It is only beneficial to user who doesn't care exception and frequently change rounding mode in their code. In this case, compiler may save setting rounding mode instructions for them. But in general, it's hard to tell if it's beneficial, even only considering scalar instructions. Because we always prefer to AVX2 instructions for their short encoding. The static rounding control only supported in AVX512 instructions and may result in larger code size finally.
Having static rounding mode intrinsics available is useful to support those architectures which have them (which I think is AVX-512, RISC-V, and pretty much every GPU architecture). For architectures which don't, it would be useful to have a pass which can lower to setting the control word and then try to minimize control word changes.
Given that C23 adds FENV_ROUND
, and hardware trends seem to be moving towards supporting static rounding mode, I expect to see it become more common over the next decade or two, and it makes sense to support it. I'd certainly design a frontend to support static rounding modes in 2023 over something fesetround
-like.
I was reading the C23 FENV_ROUND description, and it seems to be saying that the compiler should generate instructions to change the rounding mode as needed when the pragma applies, including switching back to the ambient setting for calls where it does not apply. Am I reading that correctly?
That would make having a robust pass to minimize rounding mode changes very important.
I'm saying from X86 backend, for architectures have AVX-512, we still prefer AVX2 to AVX512 for instruction encoding length, see https://godbolt.org/z/cM11vxo5r Imagining we have 100 FP intrinsics with the same rounding mode, no doubt it's better to generate one control instruction rather than 100 static rounding instructions. That says, code size must be considered for architectures have static rounding mode. But it is not easy to a middle end pass especially the EVEX to VEX transformation in unknown before RA.
I was reading the C23 FENV_ROUND description, and it seems to be saying that the compiler should generate instructions to change the rounding mode as needed when the pragma applies, including switching back to the ambient setting for calls where it does not apply. Am I reading that correctly?
Yes, FENV_ROUND
is supposed to restore the rounding mode for all calls except a select list of rounding mode-aware functions (mostly <math.h>
, but also includes things like printf
).
IEEE 754 section 4.1 describes floating point attributes with
LLVM's constrained floating point intrinsics are the only thing in LLVM to provide the constant attribute behavior without constant-folding from the environment manipulation intrinsics. Some modern ISA extensions also allow specifying the rounding mode statically within the instruction, both better matching the desired semantics of the constrained intrinsics and saving instructions setting and restoring the global floating point environment.
However, LLVM does not currently use them without target-specific intrinsics completely detached from the target-independent intrinsics which take the same information. For RISC-V, these intrinsics are paradoxically only available for vector operations, which don't have the option of a static rounding mode in the instructions, but there aren't any intrinsics for scalar operations which do have that option. X86 on the other hand does have intrinsics for both vector and scalar operations, but that seems to be because they're provided by the same extensions.
Currently, constrained intrinsics are lowered to strict opcodes with rounding mode striped completely before reaching the target-specific code generation, preventing the desired behavior.
Source (note that static rounding mode in AVX-512 only supports suppressing exceptions, hence specifying
!"fpexcept.ignore"
):Output with llvm 17.0.1
On X86 with AVX-512F:
On Riscv with F and hard-float ABI:
Expected output
On X86 with AVX-512F:
On RISC-V with F and hard-float ABI:
Godbolt link: https://godbolt.org/z/7fPoEoxs8