microsoft / hlsl-specs

HLSL Specifications
MIT License
124 stars 34 forks source link

[Feature] Compile VS/PS simultaneously to support interpolant stripping #331

Open jeremyong opened 2 years ago

jeremyong commented 2 years ago

Currently, shaders of different frequencies are fed to the compiler one at a time. This is true not just for vertex and pixel shaders, but other shaders that feed data between stages. The issue is that when compiled in isolation, the compiler has no knowledge of what inter-stage data will or will not be used when the shader module is ultimately linked in the final PSO. To manage this currently, ISVs need to juggle the VS output/PS input data in code-gen or with preprocessor magic.

VS and PS being the common case, it would be great to have an option of compiling both simultaneously, thereby giving the compiler knowledge about what inter-stage data is needed so it can strip unused entries. The interface to DXC becomes a bit more complicated because you conceivable need to namespace all DXC CLI flags/options for the frequency in question. I suggest we have a "separator" option to group flags intended for only one stage vs another. For example:

dxc.exe -I common/include/path \
    -BeginVS -I vs/only -DVSDefinition -EVSMain -Tvs_6_6 VS.hlsl -EndVS \
    -BeginPS -EPSMain -Tps_6_6 PS.hlsl -EndPS

or something to this effect. The mechanism for retrieving outputs would also need to change since the objects (DXC_OUT, pdb, reflection, etc.) are now different for each frequency in one atomic compilation. I would suggest a GetOutput1 method on an IDxcResult1 which accepts an index for the shader to retrieve data from.

TheRealMJP commented 2 years ago

I wanted to +1 this, since I think it could potentially be really useful. The case that Jeremy mentioned (stripping out VS outputs that are unused by the PS) tends to be very awkward right now with existing tools, which is really just the preprocessor or some other form of code-generation. As an example, say the VS can load a tangent frame from the vertex data and pass that to the PS. In order to create a preprocessor macro that can be used with #if/#ifdef, you need to know what high-level features are present in the pixel shader that will end up using that tangent frame. Therefore you might end up with several macros such as ENABLE_NORMAL_MAPS, ENABLE_PARALLAX_MAPS, ENABLE_ANISOTROPIC_SPECULAR, etc. which all need to get used to conditionally include that tangent frame data in the VS output struct. If these macros fall out of step with what the actual code is doing, then the tangent frame can end up getting included even though in the pixel shader the optimizer dead-strips all code that reads it. It seems like it would perhaps be more ideal to lean on the compiler's ability to determine which attributes are used or unused in this case.

In theory this sort of functionality could be done at PSO creation time in the driver's back-end compiler, but my understanding is that this is not always (or typically) done because there is pressure to minimize PSO creation time. The choice to permute the shader instead of sharing shaders between PSOs is also complex, and may not be an obvious win. Optimizations done at this stage also can't be visible to the calling application, so it can't affect reflection data or things of that nature.

As for the interface, being able to specify both the VS and PS and receive output from both seems like it would probably be ideal in terms of minimizing compilation time. An alternative might be to still compile the VS and PS separately in two different compiler invocations, but allow passing the file and parameters for the matching stage.

jeremyong commented 2 years ago

As for the interface, being able to specify both the VS and PS and receive output from both seems like it would probably be ideal in terms of minimizing compilation time. An alternative might be to still compile the VS and PS separately in two different compiler invocations, but allow passing the file and parameters for the matching stage.

This is a great alternative interface in that while we might lose a bit of compiler throughput, it may actually be preferable in a lot of workflows where you have, e.g. 1 vertex shader that needs to be paired with N pixel shaders (or the less likely inverse case).

TomHammersley commented 1 year ago

We'd also be really interested in this. Currently going through the pain of introducing lots of preprocessor magic to address this problem.

Realistically we'll never solve it 100% this way, and we'll always be leaving some perf on the table as a result.