microsoft / DirectXShaderCompiler

This repo hosts the source for the DirectX Shader Compiler which is based on LLVM/Clang.
Other
3.11k stars 692 forks source link

dxc moves wave intrisics out of a waterfall loop #6109

Open dmpots opened 11 months ago

dmpots commented 11 months ago

Description The output of a wave intrinsic depends on the current state of which threads are active. When a wave intrinsic occurs in a loop it needs to execute in the loop body since the state of active threads changes over time (threads can exit the loop at different times). The shader in this bug shows a case where dxc is incorrectly moving wave intrinsics and their dependent operations out of a loop.

In this shader the InterlockedOr operation is only done once outside the loop, but it should be done inside the loop for each unique value of arrayIndex.

The bug only repros with lifetime markers enabled. If I compile with them off (either by default in the shader model or explicitly with a flag) the code looks correct. My guess is maybe jump-threading is interfering with the dx.break calls working as expected. Note that we also get correct code with partial lifetime markers -opt-enable partial-lifetime-markers.

Steps to Reproduce

// dxc /Tps_6_6 bug.hlsl
RWBuffer<uint> rwOutput : register(u0);

struct PixelInput {
        uint m_InstanceID       : PARAM0;
};

[earlydepthstencil]
void main(PixelInput input)
{
        uint arrayIndex = input.m_InstanceID >> 5;
        uint bitIndex = (input.m_InstanceID & 31);

        bool isThisLaneNotDone = true;

        do {
                if (isThisLaneNotDone) {
                        uint firstLaneArrayIndex = WaveReadLaneFirst(arrayIndex);

                        if (firstLaneArrayIndex == arrayIndex) {
                                uint mask = WaveActiveBitOr(1u << bitIndex);

                                if (WaveIsFirstLane()) {
                                        InterlockedOr(rwOutput[firstLaneArrayIndex], mask);
                                }

                                isThisLaneNotDone = false;
                        }
                }
        } while (isThisLaneNotDone);

}

Lifetime markers on

dxc /Tps_6_6 bug.hlsl dxc /Tps_6_5 -enable-lifetime-markers bug.hlsl image

Lifetime markers off

dxc /Tps_6_5 bug.hlsl dxc /Tps_6_6 -disable-lifetime-markers bug.hlsl image

Actual Behavior The CFG with lifetime markers on shows the atomic operation is moved out of the loop.

Environment

simondeschenes commented 11 months ago

I feel like this is a problem also related to https://github.com/microsoft/DirectXShaderCompiler/issues/5302#issuecomment-1595879233

The workaround we use for the vertex shaders, we also had to keep it for the compute shaders because we have hit this exact problem even if there was a dx.break added.

The dx.break code emission seems to be based on pattern recognition and you don't need to do much changes to have things moved out of the loop.

Just switching from while(true) ... break; to while(!isDone) ... isDone = true: is enough to prevent the dx.break workaround from being emitted.

I was waiting to see progress on that other issue and the new year before commenting more on what I discovered on the wave intrinsics issues.

I think, the flaw is fundamentally that the compiler doesn't know there is more than one thread executing at a time in lockstep with those intrinsics. The returned value from the wave intrinsics should be considered volatile.

If WaveReadLaneFirst(...) and WaveIsFirstLane() were volatile, the problems would probably go away. I was still trying to wrap my head around the DXC code to try it when I went on vacation.

It would actually be the correct semantic because the compiler can't assume, like it seems to currently do, that the same call to the same wave intrinsic with the same parameter(s) will give the same result.

The problem with that approach is that the compiler won't be able to correctly identify dead code using wave intrinsics. I'd still prefer that issue to the current one where things get moved out of the loop or the loop gets completely removed.

I hope my posts helps in identifying the problem.