libsdl-org / SDL_shader_tools

Shader compiler and tools for SDLSL (Simple Directmedia Layer Shader Language)

zlib License

146 stars 3 forks source link

As someone who is very interested in SDL_gpu, both as a user and a contributor, I am concerned that writing an entirely custom shader language spec, compiler, bytecode format, and translation layer is more work than can be accomplished in a reasonable amount of time. Let alone developing the ecosystem around it — IDE plugins, RenderDoc support, tutorials, etc.

Additionally, the whole WebGPU/WGSL debacle pretty clearly demonstrated that most developers do not want Yet Another Shader Language, no matter good it may be. This seems to be the main point of contention for many developers in regards to the SDL_gpu project.

In light of these problems, I have a somewhat drastic proposal that still keeps the “ship one shader binary, run everywhere” / “no big ugly c++ dependencies” spirit of SDL_gpu intact, while heavily reducing the maintenance workload and allowing us to take advantage of existing infrastructure.

I propose that instead of using a custom shader stack, we do the following:

Use HLSL as the source language
Compile to DXBC shader model 5/5.1
At runtime, parse the DXBC and translate it to the target shader language

I’m sure you have a bunch of questions, so I’ve prepped some answers here:

Would we have to use FXC to compile our shaders? That’s not great for non-Windows users, and we can’t bundle that with SDL for dynamic source code compilation.

Not necessarily! As part of the VKD3D project, the Wine folks have written a FOSS d3dcompiler that we can use instead. The project is still relatively young, but it’s at least good enough for FNA’s purposes.

EDIT: Since writing this post, I learned that Clang is adding HLSL support with the intention of adding a DXBC backend in the future!

Why DXBC and not DXIL/SPIRV?

Unlike those intermediate formats, the DXBC spec is not ungodly huge, and it’s no longer changing. If we can translate the finite set of opcodes, we’re good to go.

In fact, there’s already a library that does that exact sort of thing — MojoShader! We can use it as a foundation (or at least, an inspiration) and build on its ideas rather than building something from first principles.

DXBC is officially deprecated. Is that going to be a problem?

Newer HLSL shader models (6.0+) are locked behind DXIL, but that’s totally fine for our purposes. SM5 contains everything we would realistically need (including compute shaders!). Unless we decide we need mesh shaders, or raytracing, or wave intrinsics, I don’t see anything we’d be missing out on.

The tooling for DXBC is definitely not going away either. Even though DX12 has been around for almost 10 years at this point, pretty much every PC game still ships with DX11. Especially since we have VKD3D to protect us from the threat of FXC bitrot, I think we will be in good shape for the foreseeable future.

Does DXBC provide any other advantages beyond reducing the development cost for SDL_gpu, allowing developers to use a familiar shader language, and leveraging the existing HLSL shader ecosystem?

Why yes, I’m glad you asked! There’s actually another huge advantage: DXBC is a real shader format that D3D11 and D3D12 can actually ingest! Meaning, we have a definitive ground truth to test from as we develop new graphics backends! If we’re ever in doubt about whether some shader behavior on Metal/Vulkan/whatever is a bug, we can check against the D3D11/12 implementation and verify. (If we want to be 100% sure we’re not witnessing driver bugs, we can check the software WARP implementation!)

Additionally, this means we could just not translate shaders on Windows! SDL can consume the shader binary and pass it through directly to D3D. Niiice! Of course, there’s nothing stopping us from translating back to HLSL/DXIL if we need to.

The Shader Model 5.0 ISA contains 203 instructions. That’s still awfully complex, isn’t it?

It’s nothing to sneeze at for sure, but a lot of the instructions are variants of each other, or only used by hull/domain/geometry shaders, which I highly doubt we are going to support. I think it’s totally reasonable to start with the SM4 instructions (of which there are 102) since those are more broadly applicable, and then add SM5 instructions as needed.

We also have this parser from VKD3D we can use as a reference if needed.

What happens if a developer tries to use a shader with opcodes that we don’t currently translate?

To ensure that developers write and ship shaders that are compatible with the subset of the ISA that we support, we can easily write up a runtime validation checker for DEBUG builds, which would scan shader bytecode input for any unrecognized opcodes and spit out error information.

Are there legal issues afoot with using a proprietary bytecode?

I sure hope not, because we’ve been shipping DXBC translators in games for many years!

So are you saying DXBC is perfect?

Nope! There are some clear drawbacks with this approach:

We don’t control the spec, so we are at the mercy of however DXBC happens to work. If the bytecode does something inefficient/awkward/painful, there’s nothing we can do about it.
FXC is a famously temperamental beast, and VKD3D’s HLSL frontend is still pretty immature. It’s more than likely we’d need to contribute patches to the compiler (which isn’t necessarily a negative, given that it would help improve the general gaming ecosystem :)).
It’s a non-trivial spec to work with, with a couple hundred opcodes. Miles better than SPIRV and friends, but still…
It is a dead end for future shader features. Imagine D3D13 comes out with a brand new kind of shader that everyone wants to use. We will never be able to support it with vanilla DXBC. (Theoretically we could fork VKD3D and add in our own custom bytecode, but that’s probably not a good idea.)

However, despite these issues, I still think DXBC is the best existing option we have, and it's worth considering before we dive full-force into writing our own entire stack.

I will voice against this for a few reasons:

FXC

FXC is a dead slow compiler due to bugs. It needs workarounds otherwise can take take a minute to compile what other compilers take milliseconds. Of course, this can be fixed with FOSS / d3dcompiler but it's worth considering.

"Pointers"

Barring some exceptions, most GPU data could be manipulated with pointers. HLSL/GLSL have RWStructuredBuffer/RStructuredBuffer/imageStore/etc. Josh Barczak talks about this in Let's close the Buffer Zoo.

Sebastian Aaltonen wrote a test to compare the quadrilion different ways of loading data.

Metal fixed this by introducing C-like pointers with annotations:

device MyStruct *myStructSSBO,
constant MyStruct &myStructUBO,
constant MyStruct *myStructArrayUBO2

The keywords device and constant are new. In C++ they can be empty defines to make it build.

It's not possible to "fully" fix the problem because at the low level the HW does have different ways of loading/storing data to memory. However in many cases the language does not need to expose those details. Specially when it comes to type conversion. HLSL makes it near impossible to load a structure from RAM that contains uint16/uint8.

For example for GCN/RDNA HW, there's only two ways of loading memory: typed and untyped. They are definitely not 10.

DXBC would make this extremely difficult to support because it is based around the SM 4.0 (not even 5.0) where the different data models were physically separate units in the HW.

From a language design point data loading/writing can be classified as:

Data is loaded raw from memory. e.g. a uint8 gets loaded as uint8. A float as a float.
Data is casted at compilation time. e.g. we load an uint8 but shader expects to receive a float in range [0; 1].
Data is casted at runtime. e.g. we load data of unknown size by specifying an offset in "elements" and shader expects to receive a float or uint.

About them:

The first one is regular C++
The second one can be implemented with C++ + [[anottations]] like float4 data [[uint8x4_unorm]] or uint8x4 data [[as(fp32)]] or with custom datatypes. e.g. sdl::uint8x4_unorm.
The third one is a texture load call and must be done the old way (i.e. loading by hand).

Mixing shader & C/C++

Metal got this right. In macOS / iOS you can do this:

#ifndef MyFoo_defined
#define MyFoo_defined

struct MyFoo
{
   uint8_t mode;
   uint8x4_t colour;
   float4 colourFP32;
   // packed_float3 has a size of 12 bytes, float3 has a size & alignment of 16
   simd::packed_float3 pos;
};

#endif

And you can write #include MyFoo.h in both. And it will just work. This makes passing data around between CPU & GPU extremely easy.

HLSL can do that too... as long as you forbid lots of datatypes (basically anything that isn't int, uint and float).

Also HLSL doesn't provide out of the box classes so that C++ compilers immediately understand float4 / packed_float3 / uint32x4_t.

DXBC doesn't translate HW well

DXBC was developed assuming GPUs were VLIW. They no longer are.

This means DXBC prefers packing up everything into a float4 "for efficiency", where in the HW it's usually the opposite. Because modern HW only has uint32 and float datatypes (they may also have uint8/16/f16/f64 but let's not dwell there for now).

It does not really have uint32x4 registers (all registers GPU are 32-bits nowadays), but it does have instructions to load 4 uint32 into 4 registers in one go (which is why alignment still matters).

No lane support

DXBC would have to be extended to support cross-lane operations. Lane ops are an advanced feature and it might be out of the scope of SDL_GPU, however there are a few basic ones (like ballot) that give a huge bang for the buck.

Cross lane operations are what allow Single Pass Downsampler to work. It can generate mipmaps for textures up to 4096x4096 in a single Compute Dispatch (that usually would need N synchronized dispatches, where N is the number of mips).

Other

Additionally, the whole WebGPU/WGSL debacle pretty clearly demonstrated that most developers do not want Yet Another Shader Language, no matter good it may be. This seems to be the main point of contention for many developers in regards to the SDL_gpu project.

No, the whole point of the WebGPU/WGSL debacle is that you don't make a source-level language the main (and only) way to feed shaders. OpenGL + GLSL went that way and it was a disaster.

Of course, nobody wants to learn yet another language; so if SDL can take existing (i.e. already written) shader code or has something like Metal which had like 90% compatibility with HLSL syntax and near 100% syntax compatibility with C++; that's great.

As part of the VKD3D project, the Wine folks have written a FOSS d3dcompiler that we can use instead.

On that same note, the VKD3D project has code to translate SPIRV to DXIL via NIR (i.e. this is what Godot is doing). By making SPIRV native instead, you get all the modern tooling around SPIRV, and can target Vulkan (and sometimes OpenGL if it has the SPIRV extension) natively, and D3D12 with little cost (SPIRV & DXIL are very similar); but indeed D3D11 gets harder (specially if you use a feature that is not available in D3D11).

Additionally SPIRV is quite easy to parse (every op is 4 bytes) and it doesn't actually change. SPIRV has many versions, and versions 1.0, 1.1, etc are frozen. They don't change.