shader-slang / slang

Making it easier to work with shaders
http://shader-slang.com
Other
2.67k stars 204 forks source link

Poor performance when reading float4x4 uniform with dynamic array index #890

Closed csyonghe closed 4 years ago

csyonghe commented 5 years ago

This may or may not be an issue of Slang, but I have observed a strange performance issue in my shader that is probably worth noting here.

On Vulkan, I defined a uniform float4x4 array in my shader code (in this case, the lightMatrix for cascaded shadow maps), and as long as I am accessing the array in a loop, the performance drops dramatically (from ~200fps to 20fps), even after I changed loop to do nothing else but accessing the data.

The performance issue is solved by manually unrolling the loop, and make sure the shader is only using constants as indices into the array. I also tried to just define float4 arrays, and construct a float4x4 by manually loading 4 float4s, but that doesn't seem to work either. However, the same issue does not exist for float4x3 arrays.

Note that it does not matter whether the code is actually executed or not, the existence of such code causes the slow down of entire pixel shader execution, even if the access is in a branch that is never executed.

I feel that this may have something to do with row_major, and if that is the case, then we definitely should generate code to handle all input fetching ourselves (as we discussed last time).

csyonghe commented 5 years ago

To provide more details, here is the code snippets I am working on:

// shader parameter definition
struct LightingEnvironment
{
    float4 zPlanes[2];
    float4x4 lightMatrix[8];
    float4 sunLightColor;
    float3 sunLightDir;
    int sunLightEnabled;
    int shadowMapId;
    int numCascades;
    int lightCount;
    int lightProbeCount;
    float4 ambient;
    StructuredBuffer<Light> lights;
    StructuredBuffer<LightProbe> lightProbes;
    SamplerState envMapSampler;
    Texture2DArray shadowMapArray;
    SamplerComparisonState shadowMapSampler;
    TextureCubeArray envMap;
    Texture2DArray lightmapArrays[12];
};
ParameterBlock<LightingEnvironment> gLighting;
// ...
// shader code (bad performance)
for (int i = 0; i < 8; i++)
{
    float4 lightSpacePosT;
    lightSpacePosT = lightEnv.lightMatrix[i][0];
    shadow *= lightSpacePosT.x;
}
// shader code (good performance)
for (int i = 0; i < 8; i++)
{
    float4 lightSpacePosT;
    switch (i)
    {
    case 0:
        lightSpacePosT = lightEnv.lightMatrix[0][0];
        break;
    case 1:
        lightSpacePosT = lightEnv.lightMatrix[1][0];
        break;
    case 2:
        lightSpacePosT = lightEnv.lightMatrix[2][0];
        break;
    case 3:
        lightSpacePosT = lightEnv.lightMatrix[3][0];
        break;
    case 4:
        lightSpacePosT = lightEnv.lightMatrix[4][0];
        break;
    case 5:
        lightSpacePosT = lightEnv.lightMatrix[5][0];
        break;
    case 6:
        lightSpacePosT = lightEnv.lightMatrix[6][0];
        break;
    case 7:
        lightSpacePosT = lightEnv.lightMatrix[7][0];
        break;
    }
    shadow *= lightSpacePosT.x;
}
tangent-vector commented 5 years ago

Can you share anything about the graphics API and/or hardware you are using to look at performance?

We should look at the output HLSL/GLSL code and confirm Slang isn’t doing anything silly like generating a local variable that is a copy of the entire array, and then indexing into that.

csyonghe commented 5 years ago

This is Vulkan / 980ti with latest driver.

Looking at the generated GLSL, it first creates a local copy of the uniform fields of the parameter block, then use it to call a function that consumes it:

struct LightingEnvironment_0
{
    vec4  zPlanes_0[2];
    mat4x4  lightMatrix_0[8];
    vec4 sunLightColor_0;
    vec3 sunLightDir_0;
    int sunLightEnabled_0;
    int shadowMapId_0;
    int numCascades_0;
    int lightCount_0;
    int lightProbeCount_0;
    vec4 ambient_0;
};

layout(binding = 0, set = 1)
layout(std140) uniform _S16
{
    LightingEnvironment_0 _data;
} gLighting_0;

void main()
{
    ...
    LightingEnvironment_0 _S80 = gLighting_0._data;
    ...
    vec4 _S85 = PbrMaterialPattern_computeForwardLighting_0(_S78, _S80, gLighting_envMapSampler_0, gLighting_shadowMapArray_0, gLighting_shadowMapSampler_0, gLighting_envMap_0, gLighting_lightmapArrays_0, _S81, _S82, gView_textureSampler_0, _S83, _S84);
    _S20 = _S85;
}

It seems that the reason could be that when accessing the array with a dynamic index, the downstream compiler can no longer simplify the copy, leading to the slowdown.

csyonghe commented 5 years ago

In my shader, I have another parameter block that also defines a large array as uniform input (float4 skinningQuaternions[160]), so these two things combined together might have resulted spilling to global memory.

csyonghe commented 5 years ago

I also noted that despite the hacky fix that brings performance to the normal level, the shaders in my engine still run slower than that before I migrated to Slang(was Spire). So this is probably still a serious issue that needs to be addressed.

tangent-vector commented 5 years ago

Yes, I agree that investigating our overall performance is important. We could try to do directed comparisons of Slang-generated and Spire-generated code to look for major discrepancies.

In this particular case, it would be good to confirm if eliminating the temporary copy of the buffer contents is enough to solve the performance problem, or if we also need to, e.g., inline or specialize the caller function.

We already have a pass to specialize callee functions based on resources that get passed in (this is a correctness rather than optimization pass). We could try to further optimize by specializing functions when one of the arguments is a load from a constant buffer, and specialize the callee based on the loaded-from buffer. We’d need a further pass to turn, e.g., a “getField” from a loaded value into a load from a “getFieldPtr” (and similarly for array references). The hardest part with that stuff would be dealing with any possible aliasing, but we can ignore that for a constant buffer since it isn’t writable.

There’s a lot of room for us to optimize this in our IR, but we need to figure out where the problems in the generated GLSL are coming from.

csyonghe commented 5 years ago

You are right. Verifying this is currently a bit tricky in my engine code since there is no way to bypass Slang and directly load a compiled SPIRV for a shader variant. To confirm this, I will need to implement a shader cache first, then manually replace the entry in the shader cache to a manually provided SPIRV code.

tangent-vector commented 5 years ago

Something I was wondering about along these lines is whether we could hijack the “dump intermediates” option in Slang to support loading from the path of a previously-dumped output instead, if one exists with an appropriate hash or similar to identify if. That would let us swap out hand-written GLSL in place of the generated code in a controlled way. It would be a backup feature, but maybe you could do it just enough to investigate.

The other option is to just try implementing the relevant optimizations and see if if helps. :)

The temporary could probably be eliminated by identifying loads from constant buffers (or pointers derived from constant buffers), and then replacing all uses of those loads with dedicated loads right before each user. The logic in emit.cpp should then be able to not emit the load to a temporary, and just fold it in as part of the function arguments on a call. If that helps, then great. If if doesn’t, then I suspect the issue is needing to inline or specialize callee functions.

I’ve been fighting really hard fo avoid having Slang do the “inline everything” style that is so common in shader compilers, but that could be costing us performance on Vulkan. Doing agressive specialization is enough to give us most of the benefits of inlining with a bit less of an explosion on the amount of generated code.

csyonghe commented 5 years ago

There are enough uncertainties here that it is wise to understand the reason. It could be as simple as that temporary right before the call, or it can be calling the function using the entire struct instead of directly referencing the constant buffer in the callee.

I agree that specialize the callee functions is a cleaner approach given how we currently handle other things.

I'll keep this updated once I figure out what causes the problem.

csyonghe commented 5 years ago

OK, I implemented a shader cache in my engine, and confirmed that specializing functions is neccessary.

Detailed performance results: Original Code (dynamic loop indexing, Slang generated code): 60ms Remove temporary: 20ms Specialize callee: 6ms Loop unroll and use literal constant indexing: 4ms

Btw, here is a screenshot of the engine: image

In this simple scene, I have a skeletal animated character (transform quaternions are uniform arrays), and cascaded shadow map (light matrices are also arrays).

tangent-vector commented 5 years ago

Hello to the couch!

Okay, it sounds like we should devise a pass to specialize callee functions when one or more of the arguments is a load of a pointer derived from a constant buffer.

I think we will also need some general-purpose simplifications to replace value-level operations (extracting fields/elements) with pointer-level operations (extracting pointers fo fields/elements).

csyonghe commented 5 years ago

While in this case specializing helps, in other cases it might be beneficial to load the uniform buffer field once and reuse it over different places.

Instead of specializing, an alternative is to make "loading" from an array explicit, by providing another resource-like type UniformArray<T> that get passed around and specialized. The shader author will need to explicitly call UniformArray.operator[] to load its content. This way, simple fields can still be loaded to local registers, while access to big arrays become an explicit operation.

tangent-vector commented 5 years ago

We could implement something like UniformArray<T> or even something like ConstRef<T> with semantics like C++ T const&, if it turns out that giving users more direct control is desirable.

My guess is that the right default is to always do this specialization when we can (especially for any types that contain arrays that might be dynamically indexed), and rely on the driver compilers to do whatever kind of CSE or redundancy elimination is appropriate for the target.