shader-slang / slang

Making it easier to work with shaders
MIT License
2.1k stars 179 forks source link

WGSL: Missing capability for ByteAddressBuffer::Load #5221

Closed aleino-nv closed 21 hours ago

aleino-nv commented 1 week ago

This issue was split from https://github.com/shader-slang/slang/issues/5172.

I think this should be easy to support for WGSL.

Affected tests under tests/compute (currently disabled):

Output from slang-test:

...
tests/compute/buffer-type-splitting.slang(21): error 36107: entrypoint 'computeMain' does not support compilation target 'wgsl' with stage 'compute'
void computeMain(int3 dispatchThreadID : SV_DispatchThreadID)
     ^~~~~~~~~~~
(0): note: see using of '__byteAddressBufferLoad'
hlsl.meta.slang(4281): note: see definition of '__byteAddressBufferLoad'
T __byteAddressBufferLoad<T>(RWByteAddressBuffer buffer, int offset, int alignment);
  ^~~~~~~~~~~~~~~~~~~~~~~
hlsl.meta.slang(4280): note: see declaration of 'require'
[require(cpp_cuda_glsl_hlsl_metal_spirv, byteaddressbuffer_rw)]
 ^~~~~~~
...

Tasks:

aleino-nv commented 1 week ago

After adding the capability, this turns into a problem with emitting WGSL:

This is for buffer-type-splitting.

Generated WGSL:

@binding(1) @group(0) var s_a_0 : array<, i32(2)>;

@binding(2) @group(0) var s_b_0 : array<, i32(2)>;

@binding(0) @group(0) var<storage, read_write> outputBuffer_0 : array<i32>;

@compute
@workgroup_size(4, 1, 1)
fn computeMain(@builtin(global_invocation_id) dispatchThreadID_0 : vec3<u32>)
{
    var i_0 : i32 = vec3<i32>(dispatchThreadID_0).x;
    var _S1 : i32 = i_0 * i32(4);
    var _S2 : u32 = (s_a_0[i32(0)]).Load<u32 >(_S1);
    var _S3 : u32 = (s_a_0[i32(1)]).Load<u32 >(_S1);
    var _S4 : u32 = _S2 + _S3 * u32(16);
    var _S5 : u32 = (s_b_0[i32(0)]).Load<u32 >(_S1);
    var _S6 : u32 = _S4 + _S5 * u32(256);
    var _S7 : u32 = (s_b_0[i32(1)]).Load<u32 >(_S1);
    outputBuffer_0[i_0] = i32(_S6 + _S7 * u32(4096));
    return;
}

slang-test errors:

WGPU error: Error while parsing WGSL: :1:41 error: expected expression for type template argument list
@binding(1) @group(0) var s_a_0 : array<, i32(2)>;
                                        ^

:3:41 error: expected expression for type template argument list
@binding(2) @group(0) var s_b_0 : array<, i32(2)>;
                                        ^

:14:41 error: expected ';' for variable declaration
    var _S2 : u32 = (s_a_0[i32(0)]).Load<u32 >(_S1);
                                        ^

:15:41 error: expected ';' for variable declaration
    var _S3 : u32 = (s_a_0[i32(1)]).Load<u32 >(_S1);
                                        ^

:17:41 error: expected ';' for variable declaration
    var _S5 : u32 = (s_b_0[i32(0)]).Load<u32 >(_S1);
                                        ^

:19:41 error: expected ';' for variable declaration
    var _S7 : u32 = (s_b_0[i32(1)]).Load<u32 >(_S1);
                                        ^

 - While validating [ShaderModuleDescriptor]
 - While calling [Device].CreateShaderModule([ShaderModuleDescriptor]).
aleino-nv commented 1 week ago

Ok it may not be easy to support this after all.

There is no counterpart to RWByteAddressBuffer in WGSL, and there is no uint8 either, so array<u8> wouldn't work.

@csyonghe I guess we just leave this unsupported for WGSL?

csyonghe commented 1 week ago

Just use array.

HLSL doesn't have uint8 either. ByteAddressBuffer in HLSL is a uint buffer, just so that the index used to access it is always divided by 4.

aleino-nv commented 1 day ago

Just use array.

HLSL doesn't have uint8 either. ByteAddressBuffer in HLSL is a uint buffer, just so that the index used to access it is always divided by 4.

I'm looking at https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-byteaddressbuffer and I can't see any restriction saying that the index used to load from a ByteAddressBuffer should be divisible by 4. I mean if you have a byte address buffer with contents [0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff] and you load from index 2, don't you expect to get back 0xffff0000 (assuming little-endian), rather than 0x00000000?

csyonghe commented 1 day ago

See https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-byteaddressbuffer-load#parameters.

aleino-nv commented 1 day ago

See https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-object-byteaddressbuffer-load#parameters.

Wow, ok! That's weird. (Then why isn't ByteAddressBuffer just array?!)

Thanks, I can proceed now.

csyonghe commented 1 day ago

HLSL never had uint8 type and they need a way to represent something like a untyped buffer, so this is what you get...

aleino-nv commented 1 day ago

After some fixes buffer-type-splitting now generates

@binding(1) @group(0) var s_a_0 : array<array<u32>, i32(2)>;

@binding(2) @group(0) var s_b_0 : array<array<u32>, i32(2)>;

@binding(0) @group(0) var<storage, read_write> outputBuffer_0 : array<i32>;

@compute
@workgroup_size(4, 1, 1)
fn computeMain(@builtin(global_invocation_id) dispatchThreadID_0 : vec3<u32>)
{
    var i_0 : i32 = vec3<i32>(dispatchThreadID_0).x;
    var _S1 : i32 = i_0 * i32(4);
    var _S2 : u32 = s_a_0[i32(0)][(_S1)/4];
    var _S3 : u32 = s_a_0[i32(1)][(_S1)/4];
    var _S4 : u32 = _S2 + _S3 * u32(16);
    var _S5 : u32 = s_b_0[i32(0)][(_S1)/4];
    var _S6 : u32 = _S4 + _S5 * u32(256);
    var _S7 : u32 = s_b_0[i32(1)][(_S1)/4];
    outputBuffer_0[i_0] = i32(_S6 + _S7 * u32(4096));
    return;
}

Which still fails because of issue https://github.com/shader-slang/slang/issues/4987:

WGPU error: Error while parsing WGSL: :1:41 error: an array element type cannot contain a runtime-sized array
@binding(1) @group(0) var s_a_0 : array<array<u32>, i32(2)>;
                                        ^^^^^^^^^^
aleino-nv commented 1 day ago

byte-address-buffer succeeds though, so that can be enabled now

aleino-nv commented 17 hours ago

After some fixes buffer-type-splitting now generates

@binding(1) @group(0) var s_a_0 : array<array<u32>, i32(2)>;

@binding(2) @group(0) var s_b_0 : array<array<u32>, i32(2)>;

@binding(0) @group(0) var<storage, read_write> outputBuffer_0 : array<i32>;

@compute
@workgroup_size(4, 1, 1)
fn computeMain(@builtin(global_invocation_id) dispatchThreadID_0 : vec3<u32>)
{
    var i_0 : i32 = vec3<i32>(dispatchThreadID_0).x;
    var _S1 : i32 = i_0 * i32(4);
    var _S2 : u32 = s_a_0[i32(0)][(_S1)/4];
    var _S3 : u32 = s_a_0[i32(1)][(_S1)/4];
    var _S4 : u32 = _S2 + _S3 * u32(16);
    var _S5 : u32 = s_b_0[i32(0)][(_S1)/4];
    var _S6 : u32 = _S4 + _S5 * u32(256);
    var _S7 : u32 = s_b_0[i32(1)][(_S1)/4];
    outputBuffer_0[i_0] = i32(_S6 + _S7 * u32(4096));
    return;
}

Which still fails because of issue #4987:

WGPU error: Error while parsing WGSL: :1:41 error: an array element type cannot contain a runtime-sized array
@binding(1) @group(0) var s_a_0 : array<array<u32>, i32(2)>;
                                        ^^^^^^^^^^

@csyonghe Do we have some legalization pass that would flatten this case out to a set of two runtime-sized arrays?