[Codegen][RISCV][X86] Emit float instructions with static rounding mode for constrained intrinsics.

boomshroom commented 10 months ago

IEEE 754 section 4.1 describes floating point attributes with

An attribute is logically associated with a program block to modify its numerical and exception semantics. A user can specify a constant value for an attribute parameter.

LLVM's constrained floating point intrinsics are the only thing in LLVM to provide the constant attribute behavior without constant-folding from the environment manipulation intrinsics. Some modern ISA extensions also allow specifying the rounding mode statically within the instruction, both better matching the desired semantics of the constrained intrinsics and saving instructions setting and restoring the global floating point environment.

However, LLVM does not currently use them without target-specific intrinsics completely detached from the target-independent intrinsics which take the same information. For RISC-V, these intrinsics are paradoxically only available for vector operations, which don't have the option of a static rounding mode in the instructions, but there aren't any intrinsics for scalar operations which do have that option. X86 on the other hand does have intrinsics for both vector and scalar operations, but that seems to be because they're provided by the same extensions.

Currently, constrained intrinsics are lowered to strict opcodes with rounding mode striped completely before reaching the target-specific code generation, preventing the desired behavior.

Source (note that static rounding mode in AVX-512 only supports suppressing exceptions, hence specifying !"fpexcept.ignore"):

define float @div(float %x, float %y) {
entry:
  %div = call float @llvm.experimental.constrained.fdiv.f32(float %x, float %y, metadata !"round.downward", metadata !"fpexcept.ignore")
  ret float %div
}

declare float @llvm.experimental.constrained.fdiv.f32(float, float, metadata, metadata)

Output with llvm 17.0.1

On X86 with AVX-512F:

div:                                    # @div
        vdivss  xmm0, xmm0, xmm1
        ret

On Riscv with F and hard-float ABI:

div:                                    # @div
        fdiv.s  fa0, fa0, fa1
        ret

Expected output

On X86 with AVX-512F:

div:                                    # @div
        vdivss  xmm0, xmm0, xmm1, {rd-sae}
        ret

On RISC-V with F and hard-float ABI:

div:                                    # @div
        fdiv.s  fa0, fa0, fa1, rdn
        ret

Godbolt link: https://godbolt.org/z/7fPoEoxs8

topperc commented 10 months ago

I believe the way the constrained intrinsics are currently defined, the rounding mode passed to the constrained intrinsic must match the current global rounding mode. The intrinsics don't force the compiler to never read the global rounding mode.

Adding @andykaylor to confirm my understanding.

boomshroom commented 10 months ago

I believe the way the constrained intrinsics are currently defined, the rounding mode passed to the constrained intrinsic must match the current global rounding mode. The intrinsics don't force the compiler to never read the global rounding mode.

Adding @andykaylor to confirm my understanding.

Why pray that they're the same when you have the option to force them to be the same? They're specified as UB if they aren't executed with the specified mode, so when given the option between a chance to not be UB and a guarantee to not be UB with no extra instructions, in what situation would the former be preferable?

Using a specific non-dynamic rounding mode which does not match the actual rounding mode at runtime results in undefined behavior.

It doesn't say the environment's rounding mode, and it's applied on the instruction rather than as a function attribute, so I'd assume it only cares about the operation's rounding mode.

topperc commented 10 months ago

Why pray that they're the same when you have the option to force them to be the same? They're specified as UB if they aren't executed with the specified mode, so when given the option between a chance to not be UB and a guarantee to not be UB with no extra instructions, in what situation would the former be preferable?

If I recall correctly AVX512 only has static rounding for 512 bit vectors and not for 256 and 128 bit vectors. The compiler would need to turn all 128 and 256 operations into 512 bit in order to guarantee static rounding would always be used with AVX512. This could have performance implications since you're wasting ALU resources not doing useful work on the padded elements.

If we don't widen the 128 and 256 operations to 512 bits and guarantee to always use the static mode, then we have an inconsistent programming model. The programmer would be required to keep the global rounding mode accurate and you wouldn't save any instructions from using the static rounding mode. Or the programmer would have to know exactly how the compiler will implement any operation using the constrained intrinsics to know for sure the global mode wouldn't be used.

I think static rounding also prevents using the reg-mem form of arithmetic instructions for AVX512.

andykaylor commented 10 months ago

I believe the way the constrained intrinsics are currently defined, the rounding mode passed to the constrained intrinsic must match the current global rounding mode. The intrinsics don't force the compiler to never read the global rounding mode.

Adding @andykaylor to confirm my understanding.

The rounding mode argument is intended to allow additional optimization when the compiler can prove the rounding mode, such as if it has just seen a call to fesetround() or if the STDC FENV_ROUND pragma has been used. The rounding mode can also be set to the default rounding mode if we're inlining a function that wasn't compiled with strictfp constraints, or if we're only using the constrained intrinsics to limit exception behavior.

In general, if the rounding mode argument to the constrained intrinsic is something other than dynamic and it doesn't match the dynamic rounding mode, it means either the user's code changed the rounding mode when it wasn't allowed to or the compiler did something wrong.

The main reason the constrained intrinsic rounding mode isn't defined as controlling the rounding mode is that it frees us from having to insert explicit instructions to set the rounding mode all over the place.

I think it's OK to generate instructions with a static rounding mode encoded if the constrained intrinsic isn't using dynamic rounding mode. We just shouldn't be required to do so.

boomshroom commented 10 months ago

If I recall correctly AVX512 only has static rounding for 512 bit vectors and not for 256 and 128 bit vectors. The compiler would need to turn all 128 and 256 operations into 512 bit in order to guarantee static rounding would always be used with AVX512. This could have performance implications since you're wasting ALU resources not doing useful work on the padded elements.

For vector arguments, yes, static rounding modes are only supported for 512-bit vectors. However, static rounding modes are also supported for scalar operations. It's possible to generate scalar float operations with static rounding modes on x86 using intrinsics like _mm_div_round_ss. On RISC-V though, no such intrinsics exist for scalar operations with static rounding mode and the only option is inline assembly.

Personally, I think static rounding modes should've been the default from the start, but it's hard to change history like that. There is also the option of representing the instructions as having an explicit dependency on the environment and constant folding sets and resets in, but I'm not sure how complex that would be to implement.

topperc commented 10 months ago

On RISC-V though, no such intrinsics exist for scalar operations with static rounding mode and the only option is inline assembly.

I do plan on tackling this next year. It took far longer than I expected to get scalar bitmanip integer intrinsics approved through RISC-V International. That finally happened a couple weeks ago. I couldn't devote the time to having 2 different intrinsic proposals in flight at once so I put off scalar floating point intrinsics.

boomshroom commented 10 months ago

I do plan on tackling this next year. It took far longer than I expected to get scalar bitmanip integer intrinsics approved through RISC-V International. That finally happened a couple weeks ago. I couldn't devote the time to having 2 different intrinsic proposals in flight at once so I put off scalar floating point intrinsics.

Oh! I'll be interested to see that. Good luck!

llvmbot commented 10 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Angelo Bulfone (boomshroom)

IEEE 754 section 4.1 describes floating point attributes with > An attribute is logically associated with a program block to modify its numerical and exception semantics. A user can specify a constant value for an attribute parameter. LLVM's constrained floating point intrinsics are the only thing in LLVM to provide the constant attribute behavior without constant-folding from the environment manipulation intrinsics. Some modern ISA extensions also allow specifying the rounding mode statically within the instruction, both better matching the desired semantics of the constrained intrinsics and saving instructions setting and restoring the global floating point environment. However, LLVM does not currently use them without target-specific intrinsics completely detached from the target-independent intrinsics which take the same information. For RISC-V, these intrinsics are paradoxically only available for vector operations, which _don't_ have the option of a static rounding mode in the instructions, but there aren't any intrinsics for scalar operations which _do_ have that option. X86 on the other hand does have intrinsics for both vector and scalar operations, but that seems to be because they're provided by the same extensions. Currently, constrained intrinsics are lowered to strict opcodes with rounding mode striped completely before reaching the target-specific code generation, preventing the desired behavior. Source (note that static rounding mode in AVX-512 only supports suppressing exceptions, hence specifying `!"fpexcept.ignore"`): ```llvm define float @div(float %x, float %y) { entry: %div = call float @llvm.experimental.constrained.fdiv.f32(float %x, float %y, metadata !"round.downward", metadata !"fpexcept.ignore") ret float %div } declare float @llvm.experimental.constrained.fdiv.f32(float, float, metadata, metadata) ``` Output with llvm 17.0.1 On X86 with AVX-512F: ```asm div: # @div vdivss xmm0, xmm0, xmm1 ret ``` On Riscv with F and hard-float ABI: ```asm div: # @div fdiv.s fa0, fa0, fa1 ret ``` Expected output On X86 with AVX-512F: ```asm div: # @div vdivss xmm0, xmm0, xmm1, {rd-sae} ret ``` On RISC-V with F and hard-float ABI: ```asm div: # @div fdiv.s fa0, fa0, fa1, rdn ret ``` Godbolt link: https://godbolt.org/z/7fPoEoxs8

llvmbot commented 10 months ago

@llvm/issue-subscribers-backend-x86

Author: Angelo Bulfone (boomshroom)

IEEE 754 section 4.1 describes floating point attributes with > An attribute is logically associated with a program block to modify its numerical and exception semantics. A user can specify a constant value for an attribute parameter. LLVM's constrained floating point intrinsics are the only thing in LLVM to provide the constant attribute behavior without constant-folding from the environment manipulation intrinsics. Some modern ISA extensions also allow specifying the rounding mode statically within the instruction, both better matching the desired semantics of the constrained intrinsics and saving instructions setting and restoring the global floating point environment. However, LLVM does not currently use them without target-specific intrinsics completely detached from the target-independent intrinsics which take the same information. For RISC-V, these intrinsics are paradoxically only available for vector operations, which _don't_ have the option of a static rounding mode in the instructions, but there aren't any intrinsics for scalar operations which _do_ have that option. X86 on the other hand does have intrinsics for both vector and scalar operations, but that seems to be because they're provided by the same extensions. Currently, constrained intrinsics are lowered to strict opcodes with rounding mode striped completely before reaching the target-specific code generation, preventing the desired behavior. Source (note that static rounding mode in AVX-512 only supports suppressing exceptions, hence specifying `!"fpexcept.ignore"`): ```llvm define float @div(float %x, float %y) { entry: %div = call float @llvm.experimental.constrained.fdiv.f32(float %x, float %y, metadata !"round.downward", metadata !"fpexcept.ignore") ret float %div } declare float @llvm.experimental.constrained.fdiv.f32(float, float, metadata, metadata) ``` Output with llvm 17.0.1 On X86 with AVX-512F: ```asm div: # @div vdivss xmm0, xmm0, xmm1 ret ``` On Riscv with F and hard-float ABI: ```asm div: # @div fdiv.s fa0, fa0, fa1 ret ``` Expected output On X86 with AVX-512F: ```asm div: # @div vdivss xmm0, xmm0, xmm1, {rd-sae} ret ``` On RISC-V with F and hard-float ABI: ```asm div: # @div fdiv.s fa0, fa0, fa1, rdn ret ``` Godbolt link: https://godbolt.org/z/7fPoEoxs8

phoebewang commented 10 months ago

I think it's highly relied on user scenarios. It is only beneficial to user who doesn't care exception and frequently change rounding mode in their code. In this case, compiler may save setting rounding mode instructions for them. But in general, it's hard to tell if it's beneficial, even only considering scalar instructions. Because we always prefer to AVX2 instructions for their short encoding. The static rounding control only supported in AVX512 instructions and may result in larger code size finally.

jcranmer-intel commented 10 months ago

Having static rounding mode intrinsics available is useful to support those architectures which have them (which I think is AVX-512, RISC-V, and pretty much every GPU architecture). For architectures which don't, it would be useful to have a pass which can lower to setting the control word and then try to minimize control word changes.

Given that C23 adds FENV_ROUND, and hardware trends seem to be moving towards supporting static rounding mode, I expect to see it become more common over the next decade or two, and it makes sense to support it. I'd certainly design a frontend to support static rounding modes in 2023 over something fesetround-like.

andykaylor commented 10 months ago

I was reading the C23 FENV_ROUND description, and it seems to be saying that the compiler should generate instructions to change the rounding mode as needed when the pragma applies, including switching back to the ambient setting for calls where it does not apply. Am I reading that correctly?

That would make having a robust pass to minimize rounding mode changes very important.

phoebewang commented 10 months ago

I'm saying from X86 backend, for architectures have AVX-512, we still prefer AVX2 to AVX512 for instruction encoding length, see https://godbolt.org/z/cM11vxo5r Imagining we have 100 FP intrinsics with the same rounding mode, no doubt it's better to generate one control instruction rather than 100 static rounding instructions. That says, code size must be considered for architectures have static rounding mode. But it is not easy to a middle end pass especially the EVEX to VEX transformation in unknown before RA.

jcranmer-intel commented 10 months ago

I was reading the C23 FENV_ROUND description, and it seems to be saying that the compiler should generate instructions to change the rounding mode as needed when the pragma applies, including switching back to the ambient setting for calls where it does not apply. Am I reading that correctly?

Yes, FENV_ROUND is supposed to restore the rounding mode for all calls except a select list of rounding mode-aware functions (mostly <math.h>, but also includes things like printf).

llvm / llvm-project

[Codegen][RISCV][X86] Emit float instructions with static rounding mode for constrained intrinsics. #75543