Interlocked on floats (at least Max/Min)

Nielsbishere commented 1 year ago

Is your feature request related to a problem? Please describe. With compute shaders, atomics have become a big part of shaders and the lack of float support can be quite annoying sometimes. I understand that atomic add on a float might be hard because of possible hardware support, but min/max could be supported in software too. Atomic min/max would essentially be possible by doing the following conversion before min/max: u = asuint(f); u = u >> 31 ? ~u : (u | (1 << 31)). Then doing atomics using u and reversing it when reading back. Even though this is possible manually, it'd be great if the language supports this without floating point trickery.

Describe the solution you'd like Support for interlocked floats, most importantly min/max. Add would be nice too, but understandable if it can't be supported. Float atomics have existed in OpenGL GLSL with an NV specific extension.

Describe alternatives you've considered Doing it manually, but might be an obstacle for new programmers.

Additional context N.A.

devshgraphicsprogramming commented 1 year ago

btw you can emit the SPIR-V Opcode you need directly via Inline SPIR-V already present in DXC

Nielsbishere commented 1 year ago

@devshgraphicsprogramming Yes, if you're using Vulkan you can enable the VK_EXT_shader_atomic_float if supported. DirectX12 to my knowledge doesn't support it (because DXIL doesn't). (Also OpenGL does support it with an NV extension)

As a sidenote; the solution I provided doesn't work with NaNs because NaNs always return false when compared, while my hack would assume NaNs are real numbers so they'd get turned into something bigger than inf. So just don't throw in a NaN and it should be fine :). As for correct IEEE754 behavior for NaN: min is: a = a < b ? a : b; Then for a NaN as 'a' that'd return b always and for max too. Even if it's a NaN passed as b, it will always be returned. So imo ignoring the NaN in the min/max is the IEEE754 comformant-ish way to do it (except if it's the min/max are only performed on NaNs). I saw @jeremyong liked this issue and also made a good blogpost explaining this in further detail at https://www.jeremyong.com/graphics/2023/09/05/f32-interlocked-min-max-hlsl/. Just to note, this solution does still work on halfs and doubles as well; but the sign check (shift) and mask should be corrected to 15 or 63 respectively of course. In fact it works with any IEEE754 compliant format that puts the sign bit, then exponent and then mantissa in that order. N = mantissa + exponent (15 = half, 31 = float, 63 = double). (Though uint64_t atomics would be needed for doubles and 1 << N should be a uint64_t shifted first too)

devshgraphicsprogramming commented 1 year ago

just to clarify I'm not trivializing or against this being implemented in HLSL/DXC and added to DXIL, just giving pointers if you want to achieve this "today" in SPIR-V env.

devshgraphicsprogramming commented 1 year ago

As a sidenote; the solution I provided doesn't work with NaNs because NaNs always return false when compared, while my hack would assume NaNs are real numbers so they'd get turned into something bigger than inf. So just don't throw in a NaN and it should be fine :). As for correct IEEE754 behavior for NaN: min is: a = a < b ? a : b; Then for a NaN as 'a' that'd return b always and for max too. Even if it's a NaN passed as b, it will always be returned. So imo ignoring the NaN in the min/max is the IEEE754 comformant-ish way to do it (except if it's the min/max are only performed on NaNs). I saw @jeremyong liked this issue and also made a good blogpost explaining this in further detail at https://www.jeremyong.com/graphics/2023/09/05/f32-interlocked-min-max-hlsl/. Just to note, this solution does still work on halfs and doubles as well; but the sign check (shift) and mask should be corrected to 15 or 63 respectively of course. In fact it works with any IEEE754 compliant format that puts the sign bit, then exponent and then mantissa in that order. N = mantissa + exponent (15 = half, 31 = float, 63 = double). (Though uint64_t atomics would be needed for doubles and 1 << N should be a uint64_t shifted first too)

This is obviously a really cool technique.

As a sidenote to a sidenote, we can do all sorts of "atomic" operations with CAS loops but for this and for the trick above it gets really nasty to maintain the code that does this without having any sort of reference type, right now you cam abuse the bug that #5377 will fix soon and have

template<typename T>
T myEsotericAtomic(inout T rval, in T operand);

and do whatever you like inside (CAS loop, call to an inline SPIR-V instrinsic), plus have this work both with groupshared, RWStructuredBuffer, and if new BDA ships before https://github.com/microsoft/DirectXShaderCompiler/issues/5377 even with the result of vk::BufferPointer::Get().

Once #5377 ships before we're given true T& references or some sort of a crutch, the only way to implement this is with a macro, even if you use an accessor pseudo-lambda/functor-struct simply because you'll have to copy&paste the code into the method definition a million times.

microsoft / hlsl-specs

Interlocked on floats (at least Max/Min) #29