shader-slang / slang

Making it easier to work with shaders
http://shader-slang.com
MIT License
2.16k stars 186 forks source link

Slang emits `OpUConvert` before `OpConvertUToF` when casting uchar4 to float4 #4978

Closed natevm closed 2 months ago

natevm commented 2 months ago

In general, converting from integers to floats is expensive, and requires using the SFUs in an SM. But with NVIDIA HW, it's very fast to convert from a uchar4 to a float4 directly. We use this hot-path for fast ray-AABB slab testing where slabs are represented in a quantized 1-byte format.

Despite using a direct cast from uint8_t4's here to a float4: image

I'm observing an excessive bottleneck of this traversal code written in Slang. The red bar is indicating excessive stalls in the special function unit pipe (XU) due to many full 32-bit uint to float conversions:

image

When using uint8_t4 in Slang, I would have thought that would leverage NVIDIA's hot path and go straight from a uint8_t4 to a float4 like it does in CUDA. I do see some direct uchar4 to float4 conversion happening here from Slang to SPIR-V, but it seems slang emits these for uchar4's on an element-by-element basis:

image

From there, I think this must be an NVIDIA driver bug in SPIR-V to SASS or something... But all uchar4's are promoted to full uint32_t4's, and then those fall into the expensive uint->float path:

image

Usually BVH traversal is bottlenecked by memory bandwidth, but because of this issue, I'm actually bottlenecked by the SFU here...

natevm commented 2 months ago

For some context, here's the text I'm referencing: https://users.aalto.fi/~laines9/publications/ylitie2017hpg_paper.pdf

And then the relevant section is here: image

So I'm trying to use that byte to FP32 path.

I recognize this isn't exactly a slang bug. But perhaps if slang were to emit a true-to-form uchar4 to float4 conversion, then the driver would pick that up correctly?

csyonghe commented 2 months ago

Your code seems to be doing float3(u8,u8,u8) instead of float3(uint8_t3(u8, u8, u8))?

csyonghe commented 2 months ago

In this case, you are calling float3(f32, f32, f32) with u8 scalars, and that is going to lead to individual casts.

natevm commented 2 months ago

I see. I’ll try out using the cast to a uint8_t3.

Do you know why the final disassembly first casts the uint8’s back into full 32-bit uints before doing the cast to float?

csyonghe commented 2 months ago

I do not. I guess it is likely because there isn't an opcode in nvidia's hardware to convert directly from u8 to float.

natevm commented 2 months ago

There indeed is an instruction to directly convert from u8 to float. The paper written by Ylitie, who is at nvidia, explicitly points this out.

csyonghe commented 2 months ago

Then it is a driver compiler bug.

natevm commented 2 months ago

In the past, I remember working with Ashwin Lele to solve some of these sorts of bugs. Does he have a GH handle? Perhaps he’d be a good first PoC

natevm commented 2 months ago

Just to confirm your suspicion, if I first cast to uint8_t3, I do correctly see the cast from u8v3 to float3 now in Slang's SPIR-V: image

So then the crux of the issue now is determining why the driver does u8->u32->float instead of u8->float: image

If I read the above right, the OpUConvert in the disassembly goes from u8->u32, and then this u32 is passed into the OpConvertUToF .

In the SPIR-V spec, iiuc, the input only needs to be an integer type of the same vector component width, but not necessarily the same byte-length: image

natevm commented 2 months ago

Hm, there might still be some slang bug...

In slang, if I do float4 qlox4 = float4(uint8_t4(node.lox[i] >> byteShift));

Then the slang compiler seems to also incorrectly emit a u8v4->u32v4 conversion. image

I’ll try to make some simpler reproducers

natevm commented 2 months ago

Some updates... I was able to use Slang's inline SPIRV to force a u8v4 to fp32v4 cast:

float4 OpConvertUToF(uint8_t4 u) {
    return spirv_asm {
        result : $$float4 = OpConvertUToF $u;
    };
}

With this, I also see the same direct cast occur in the disassembly. However, the performance is still not so good. Perhaps hardware has changed since this paper, and these days it's faster to do a uint32 -> float cast...

For now, closing, since it seems this isn't a priority.