Closed TheSpydog closed 1 month ago
I will voice against this for a few reasons:
FXC is a dead slow compiler due to bugs. It needs workarounds otherwise can take take a minute to compile what other compilers take milliseconds. Of course, this can be fixed with FOSS / d3dcompiler but it's worth considering.
Barring some exceptions, most GPU data could be manipulated with pointers. HLSL/GLSL have RWStructuredBuffer/RStructuredBuffer/imageStore/etc. Josh Barczak talks about this in Let's close the Buffer Zoo.
Sebastian Aaltonen wrote a test to compare the quadrilion different ways of loading data.
Metal fixed this by introducing C-like pointers with annotations:
device MyStruct *myStructSSBO,
constant MyStruct &myStructUBO,
constant MyStruct *myStructArrayUBO2
The keywords device
and constant
are new. In C++ they can be empty defines to make it build.
It's not possible to "fully" fix the problem because at the low level the HW does have different ways of loading/storing data to memory. However in many cases the language does not need to expose those details. Specially when it comes to type conversion. HLSL makes it near impossible to load a structure from RAM that contains uint16/uint8.
For example for GCN/RDNA HW, there's only two ways of loading memory: typed and untyped. They are definitely not 10.
DXBC would make this extremely difficult to support because it is based around the SM 4.0 (not even 5.0) where the different data models were physically separate units in the HW.
From a language design point data loading/writing can be classified as:
About them:
[[anottations]]
like float4 data [[uint8x4_unorm]]
or uint8x4 data [[as(fp32)]]
or with custom datatypes. e.g. sdl::uint8x4_unorm
.Metal got this right. In macOS / iOS you can do this:
#ifndef MyFoo_defined
#define MyFoo_defined
struct MyFoo
{
uint8_t mode;
uint8x4_t colour;
float4 colourFP32;
// packed_float3 has a size of 12 bytes, float3 has a size & alignment of 16
simd::packed_float3 pos;
};
#endif
And you can write #include MyFoo.h
in both. And it will just work. This makes passing data around between CPU & GPU extremely easy.
HLSL can do that too... as long as you forbid lots of datatypes (basically anything that isn't int, uint and float).
Also HLSL doesn't provide out of the box classes so that C++ compilers immediately understand float4 / packed_float3 / uint32x4_t.
DXBC was developed assuming GPUs were VLIW. They no longer are.
This means DXBC prefers packing up everything into a float4 "for efficiency", where in the HW it's usually the opposite. Because modern HW only has uint32 and float datatypes (they may also have uint8/16/f16/f64 but let's not dwell there for now).
It does not really have uint32x4 registers (all registers GPU are 32-bits nowadays), but it does have instructions to load 4 uint32 into 4 registers in one go (which is why alignment still matters).
DXBC would have to be extended to support cross-lane operations. Lane ops are an advanced feature and it might be out of the scope of SDL_GPU, however there are a few basic ones (like ballot) that give a huge bang for the buck.
Cross lane operations are what allow Single Pass Downsampler to work. It can generate mipmaps for textures up to 4096x4096 in a single Compute Dispatch (that usually would need N synchronized dispatches, where N is the number of mips).
Additionally, the whole WebGPU/WGSL debacle pretty clearly demonstrated that most developers do not want Yet Another Shader Language, no matter good it may be. This seems to be the main point of contention for many developers in regards to the SDL_gpu project.
No, the whole point of the WebGPU/WGSL debacle is that you don't make a source-level language the main (and only) way to feed shaders. OpenGL + GLSL went that way and it was a disaster.
Of course, nobody wants to learn yet another language; so if SDL can take existing (i.e. already written) shader code or has something like Metal which had like 90% compatibility with HLSL syntax and near 100% syntax compatibility with C++; that's great.
As part of the VKD3D project, the Wine folks have written a FOSS d3dcompiler that we can use instead.
On that same note, the VKD3D project has code to translate SPIRV to DXIL via NIR (i.e. this is what Godot is doing). By making SPIRV native instead, you get all the modern tooling around SPIRV, and can target Vulkan (and sometimes OpenGL if it has the SPIRV extension) natively, and D3D12 with little cost (SPIRV & DXIL are very similar); but indeed D3D11 gets harder (specially if you use a feature that is not available in D3D11).
Additionally SPIRV is quite easy to parse (every op is 4 bytes) and it doesn't actually change. SPIRV has many versions, and versions 1.0, 1.1, etc are frozen. They don't change.
I forgot to mention something: It's not that I'm strongly opposed to DXBC.
It's that DXBC has a lot of flaws and they need to be exposed to make a good informed decision.
Ultimately it boils down to what SDL GPU is supposed to cover.
TheSpydog pointed out the good stuff. I'm pointing out the bad stuff.
We ended up solving the shader input issue in such a way where we're not terribly opinionated about the format anymore, so this doesn't need to be done anymore.
(Also since then MS announced that Shader Model 7 is using SPIR-V, so maybe we won't have to think about this too much long-term!)
As someone who is very interested in SDL_gpu, both as a user and a contributor, I am concerned that writing an entirely custom shader language spec, compiler, bytecode format, and translation layer is more work than can be accomplished in a reasonable amount of time. Let alone developing the ecosystem around it — IDE plugins, RenderDoc support, tutorials, etc.
Additionally, the whole WebGPU/WGSL debacle pretty clearly demonstrated that most developers do not want Yet Another Shader Language, no matter good it may be. This seems to be the main point of contention for many developers in regards to the SDL_gpu project.
In light of these problems, I have a somewhat drastic proposal that still keeps the “ship one shader binary, run everywhere” / “no big ugly c++ dependencies” spirit of SDL_gpu intact, while heavily reducing the maintenance workload and allowing us to take advantage of existing infrastructure.
I propose that instead of using a custom shader stack, we do the following:
I’m sure you have a bunch of questions, so I’ve prepped some answers here:
Would we have to use FXC to compile our shaders? That’s not great for non-Windows users, and we can’t bundle that with SDL for dynamic source code compilation.
Not necessarily! As part of the VKD3D project, the Wine folks have written a FOSS d3dcompiler that we can use instead. The project is still relatively young, but it’s at least good enough for FNA’s purposes.
EDIT: Since writing this post, I learned that Clang is adding HLSL support with the intention of adding a DXBC backend in the future!
Why DXBC and not DXIL/SPIRV?
Unlike those intermediate formats, the DXBC spec is not ungodly huge, and it’s no longer changing. If we can translate the finite set of opcodes, we’re good to go.
In fact, there’s already a library that does that exact sort of thing — MojoShader! We can use it as a foundation (or at least, an inspiration) and build on its ideas rather than building something from first principles.
DXBC is officially deprecated. Is that going to be a problem?
Newer HLSL shader models (6.0+) are locked behind DXIL, but that’s totally fine for our purposes. SM5 contains everything we would realistically need (including compute shaders!). Unless we decide we need mesh shaders, or raytracing, or wave intrinsics, I don’t see anything we’d be missing out on.
The tooling for DXBC is definitely not going away either. Even though DX12 has been around for almost 10 years at this point, pretty much every PC game still ships with DX11. Especially since we have VKD3D to protect us from the threat of FXC bitrot, I think we will be in good shape for the foreseeable future.
Does DXBC provide any other advantages beyond reducing the development cost for SDL_gpu, allowing developers to use a familiar shader language, and leveraging the existing HLSL shader ecosystem?
Why yes, I’m glad you asked! There’s actually another huge advantage: DXBC is a real shader format that D3D11 and D3D12 can actually ingest! Meaning, we have a definitive ground truth to test from as we develop new graphics backends! If we’re ever in doubt about whether some shader behavior on Metal/Vulkan/whatever is a bug, we can check against the D3D11/12 implementation and verify. (If we want to be 100% sure we’re not witnessing driver bugs, we can check the software WARP implementation!)
Additionally, this means we could just not translate shaders on Windows! SDL can consume the shader binary and pass it through directly to D3D. Niiice! Of course, there’s nothing stopping us from translating back to HLSL/DXIL if we need to.
The Shader Model 5.0 ISA contains 203 instructions. That’s still awfully complex, isn’t it?
It’s nothing to sneeze at for sure, but a lot of the instructions are variants of each other, or only used by hull/domain/geometry shaders, which I highly doubt we are going to support. I think it’s totally reasonable to start with the SM4 instructions (of which there are 102) since those are more broadly applicable, and then add SM5 instructions as needed.
We also have this parser from VKD3D we can use as a reference if needed.
What happens if a developer tries to use a shader with opcodes that we don’t currently translate?
To ensure that developers write and ship shaders that are compatible with the subset of the ISA that we support, we can easily write up a runtime validation checker for DEBUG builds, which would scan shader bytecode input for any unrecognized opcodes and spit out error information.
Are there legal issues afoot with using a proprietary bytecode?
I sure hope not, because we’ve been shipping DXBC translators in games for many years!
So are you saying DXBC is perfect?
Nope! There are some clear drawbacks with this approach:
However, despite these issues, I still think DXBC is the best existing option we have, and it's worth considering before we dive full-force into writing our own entire stack.