microsoft / DirectXShaderCompiler

This repo hosts the source for the DirectX Shader Compiler which is based on LLVM/Clang.
Other
3.05k stars 677 forks source link

Incorrect code for waterfall loop in VS shader #5302

Open dmpots opened 1 year ago

dmpots commented 1 year ago

If I write a waterfall loop to implement my own version of NonUniformResourceIndex

// dxc /Tps_6_0 .\break.hlsl /DOUTPUT=SV_Target
// dxc /Tvs_6_0 .\break.hlsl /DOUTPUT=Z
StructuredBuffer<int> mainBuf[]: register(t2, space0);

[RootSignature("DescriptorTable(SRV(t2, numDescriptors=UNBOUNDED))")]
int main(int a : A, int b : B, int c : C) : OUTPUT
{
  int res = 0;

  for (;;) {
      int u = WaveReadLaneFirst(a);
      if (a == u) {
          res += mainBuf[u][b];
          break;
        }
    }
  return res;
}

It works fine if I compile the shader as a PS (dxc /Tps_6_0 .\break.hlsl /DOUTPUT=SV_Target). The buffer load is kept in the body of the loop using the dx.break mechanism.

  %1 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @dx.break.cond, i32 0, i32 0)
  %2 = icmp eq i32 %1, 0
  %3 = call i32 @dx.op.loadInput.i32(i32 4, i32 1, i32 0, i8 0, i32 undef)  ; LoadInput(inputSigId,rowIndex,colIndex,gsVertexAxis)
  %4 = call i32 @dx.op.loadInput.i32(i32 4, i32 0, i32 0, i8 0, i32 undef)  ; LoadInput(inputSigId,rowIndex,colIndex,gsVertexAxis)
  br label %5

; <label>:5                                       ; preds = %15, %0
  %6 = phi i32 [ 0, %0 ], [ %16, %15 ]
  %7 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %4)  ; WaveReadLaneFirst(value)
  %8 = icmp eq i32 %4, %7
  br i1 %8, label %9, label %15

; <label>:9                                       ; preds = %5
  %10 = add i32 %7, 2
  %11 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 %10, i1 false)  ; CreateHandle(resourceClass,rangeId,index,nonUniformIndex)
  %12 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %11, i32 %3, i32 0)  ; BufferLoad(srv,index,wot)
  %13 = extractvalue %dx.types.ResRet.i32 %12, 0
  %14 = add nsw i32 %13, %6
  br i1 %2, label %17, label %15

; <label>:15                                      ; preds = %9, %5
  %16 = phi i32 [ %14, %9 ], [ %6, %5 ]
  br label %5

; <label>:17                                      ; preds = %9
  call void @dx.op.storeOutput.i32(i32 5, i32 0, i32 0, i8 0, i32 %14)  ; StoreOutput(outputSigId,rowIndex,colIndex,value)
  ret void

However, if I compile the shader as a VS (dxc /Tvs_6_0 .\break.hlsl /DOUTPUT=Z) the buffer load is moved outside the loop body.

  %1 = call i32 @dx.op.loadInput.i32(i32 4, i32 0, i32 0, i8 0, i32 undef)  ; LoadInput(inputSigId,rowIndex,colIndex,gsVertexAxis)
  br label %2

; <label>:2                                       ; preds = %2, %0
  %3 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %1)  ; WaveReadLaneFirst(value)
  %4 = icmp eq i32 %1, %3
  br i1 %4, label %5, label %2

; <label>:5                                       ; preds = %2
  %6 = call i32 @dx.op.loadInput.i32(i32 4, i32 1, i32 0, i8 0, i32 undef)  ; LoadInput(inputSigId,rowIndex,colIndex,gsVertexAxis)
  %7 = add i32 %3, 2
  %8 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 %7, i1 false)  ; CreateHandle(resourceClass,rangeId,index,nonUniformIndex)
  %9 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %8, i32 %6, i32 0)  ; BufferLoad(srv,index,wot)
  %10 = extractvalue %dx.types.ResRet.i32 %9, 0
  call void @dx.op.storeOutput.i32(i32 5, i32 0, i32 0, i8 0, i32 %10)  ; StoreOutput(outputSigId,rowIndex,colIndex,value)
  ret void

I don't see any reason why the VS codegen should be different here.

The dx.break mechanism was added in #2795 to keep wave-sensitive values in a loop.

simondeschenes commented 1 year ago

This is a bug we experienced since 2019. Our workaround is to have a second conditional in the loop with a second WaveReadLaneFirst around the break.

Here is something that should work and unblock you in the meantime.

while (!IsHelperLane()) // support for any shader type
{
    if (a == WaveReadLaneFirst(a))
    {
        // it is necessary to read from the first lane 
        // many times repeatedly like this to prevent 
        // the compiler from combining the 2 conditionals 
        // which leads to the same problematic result
        int u = WaveReadLaneFirst(a);
        ... your code ...
    }

    if (a == WaveReadLaneFirst(a))
    {
        break;
    }
}

On the net the problem has been mentionned multiple times.

Here under the "The classic subgroup breakage – maximal convergence gone awry" section: https://themaister.net/blog/2022/04/24/my-personal-hell-of-translating-dxil-to-spir-v

Or here where we see the need for a second branch: http://shader-playground.timjones.io/2fc43f77f626c4883e1b1ed23fd8b2e1

The problem has also been reported here many times, but somehow I can find the issues with a simple search right now.

I would also like to see some movement on this because this bug has been a problem for 4 years already.

StarsX commented 1 year ago

Try adding a barrier for workaround?

dmpots commented 1 year ago

This is a bug we experienced since 2019. Our workaround is to have a second conditional in the loop with a second WaveReadLaneFirst around the break.

@simondeschenes I believe this case should be fixed with the introduction of the dx.break in #2795 (and improved with later PRs). If I try putting the break in the loop for your example, it does keep the break in the body of the loop and guards the exit with the additional dx.break comparison.

https://shader-playground.timjones.io/54c25778c86a04519b8cb2f89e2f59bb

If you have cases where things are incorrectly moved out of a loop it would be great to file a bug. I think the old bugs are considered closed by the previously mentioned PR.

The bug I am reporting here is that the same dx.break mechanism does not work for VS. The fix seems to currently be restricted to PS and CS.

simondeschenes commented 1 year ago

Ok, I see dx.break.cond code in the shader playground which should be correct. My failing cases also seem to all be in the vertex shaders now. It is nice to know there is a working fix for the compute and pixel shaders passes. I just assumed that if our vertex shader was still broken it was still broken everywhere.

The workaround I posted should still allow you to have something that works in the Vertex Shader until the vertex shader is also fixed.

simondeschenes commented 1 year ago

I don't have the time/rights/patience to go through PR and creating new tests to fix the bug. However, I think I found it and someone else could fix it and make a pull request.

In tools/clang/lib/CodeGen/CGHLSLMS.cpp, in function EmitHLSLCondBreak, around line 4660:

  if (!m_pHLModule->GetShaderModel()->IsPS() && !m_pHLModule->GetShaderModel()->IsCS() &&
      !m_pHLModule->GetShaderModel()->IsLib()) {
    return CGF.Builder.CreateBr(DestBB);
  }

I don't know why it is limited to LIB, PS and CS. However, this is what seems to prevent the fix from applying to vertex, geometry, hull, domain, ray, mesh and amplification shaders.