Closed natevm closed 2 months ago
For some context, here's the text I'm referencing: https://users.aalto.fi/~laines9/publications/ylitie2017hpg_paper.pdf
And then the relevant section is here:
So I'm trying to use that byte to FP32 path.
I recognize this isn't exactly a slang bug. But perhaps if slang were to emit a true-to-form uchar4 to float4 conversion, then the driver would pick that up correctly?
Your code seems to be doing float3(u8,u8,u8) instead of float3(uint8_t3(u8, u8, u8))?
In this case, you are calling float3(f32, f32, f32) with u8 scalars, and that is going to lead to individual casts.
I see. I’ll try out using the cast to a uint8_t3.
Do you know why the final disassembly first casts the uint8’s back into full 32-bit uints before doing the cast to float?
I do not. I guess it is likely because there isn't an opcode in nvidia's hardware to convert directly from u8 to float.
There indeed is an instruction to directly convert from u8 to float. The paper written by Ylitie, who is at nvidia, explicitly points this out.
Then it is a driver compiler bug.
In the past, I remember working with Ashwin Lele to solve some of these sorts of bugs. Does he have a GH handle? Perhaps he’d be a good first PoC
Just to confirm your suspicion, if I first cast to uint8_t3, I do correctly see the cast from u8v3 to float3 now in Slang's SPIR-V:
So then the crux of the issue now is determining why the driver does u8->u32->float instead of u8->float:
If I read the above right, the OpUConvert in the disassembly goes from u8->u32, and then this u32 is passed into the OpConvertUToF .
In the SPIR-V spec, iiuc, the input only needs to be an integer type of the same vector component width, but not necessarily the same byte-length:
Hm, there might still be some slang bug...
In slang, if I do float4 qlox4 = float4(uint8_t4(node.lox[i] >> byteShift));
Then the slang compiler seems to also incorrectly emit a u8v4->u32v4 conversion.
I’ll try to make some simpler reproducers
Some updates... I was able to use Slang's inline SPIRV to force a u8v4 to fp32v4 cast:
float4 OpConvertUToF(uint8_t4 u) {
return spirv_asm {
result : $$float4 = OpConvertUToF $u;
};
}
With this, I also see the same direct cast occur in the disassembly. However, the performance is still not so good. Perhaps hardware has changed since this paper, and these days it's faster to do a uint32 -> float cast...
For now, closing, since it seems this isn't a priority.
In general, converting from integers to floats is expensive, and requires using the SFUs in an SM. But with NVIDIA HW, it's very fast to convert from a uchar4 to a float4 directly. We use this hot-path for fast ray-AABB slab testing where slabs are represented in a quantized 1-byte format.
Despite using a direct cast from uint8_t4's here to a float4:
I'm observing an excessive bottleneck of this traversal code written in Slang. The red bar is indicating excessive stalls in the special function unit pipe (XU) due to many full 32-bit uint to float conversions:
When using uint8_t4 in Slang, I would have thought that would leverage NVIDIA's hot path and go straight from a uint8_t4 to a float4 like it does in CUDA. I do see some direct uchar4 to float4 conversion happening here from Slang to SPIR-V, but it seems slang emits these for uchar4's on an element-by-element basis:
From there, I think this must be an NVIDIA driver bug in SPIR-V to SASS or something... But all uchar4's are promoted to full uint32_t4's, and then those fall into the expensive uint->float path:
Usually BVH traversal is bottlenecked by memory bandwidth, but because of this issue, I'm actually bottlenecked by the SFU here...