Strange result when using DPAS with local memory

Hi, I'm continuing to investigate the strange correctness issue I described in https://github.com/intel/intel-graphics-compiler/issues/275 (one of two DPAS calls is returning a zero matrix, i.e. the values of C matrix in A * B + C). I found that the issue arises when all of the followings are true:

(1) DPAS is used (2) local memory is used (3) Multiple DPAS are executed by a single workgroup, either via a single subgroup doing multiple DPAS or multiple subgroups (say, in a workgroup with 2 x 8 threads) doing one DPAS each.

Any combination of the two conditions above works fine, so for example if I don't use local memory or DPAS at all, everything is fine. Or using DPAS + local memory to compute a single fp16 m8n8k16 matmul is also fine.

I wrote a SYCL program that demonstrates what I'm trying to do with a custom DSL + SPIRV code generator (TVM). https://gist.github.com/masahi/92f45c2f9ab456259fbb8903c864b2d3

Interestingly, the SYCL program above works correctly. So I tried to compare the difference in the generated SPIRV between clang and the custom code generator in TVM, and also the final ISA generated for the two cases.

Here are the two IGC dumps: dump_sycl.zip (result is correct) dump_tvm.zip (result is incorrect)

The two SPIRV look very similar, the only big difference is how the local memory is declared. While the SYCL one passes local memory as a argument to the kernel, the TVM one declares it as a global variable with a static allocation size. So for the latter case, LLVM IR begins with:

@main_kernel0.X_shared = addrspace(3) global [256 x float] undef, section "localSLM", align 4
@main_kernel0.W_shared = addrspace(3) global [128 x float] undef, section "localSLM", align 4
...
define spir_kernel void @main_kernel0(float addrspace(1)* noalias %X, float addrspace(1)* noalias %W, float addrspace(1)* noalias %compute, <8 x i32> %r0, <8 x i32> %payloadHeader, i16 %localIdX, i16 %localIdY, i16 %localIdZ, i32 %bufferOffset, i32 %bufferOffset1, i32 %bufferOffset2) #0 {
entry:
..

And comparing the two ISA, I found that the SYCL one has sync.allrd instruction before DPAS:

        sync.nop                             null                             {Compacted,A@1}        // $202
        sync.allrd                           ($0,$1)                                                 // $202
        dpas.8x8 (8|M0)          r31:f         null:f            r23:hf            r63.0:hf         {$8} // $202

while the TVM doesn't:

        sync.nop                             null                             {Compacted,F@2}        // $74
        dpas.8x8 (8|M0)          r77:f         null:f            r69:hf            r45.0:hf         {$8} // $74

If I remove the use of local memory from the SYCL program, the sync.allrd is not generated. So I'm assuming that the presence of sync.allrd is unique to the usage of local memory, and if the absence of a necessary sync.allrd can cause a trouble for DPAS, that would explain the correctness issue I'm observing. I spent a lot of time debugging this issue, and I'm increasingly leaning toward thinking that this is another IGC miscompilation bug, similar to https://github.com/intel/intel-graphics-compiler/issues/275.

So here are my questions:

Does the difference in how local memory is declared (global variable vs passed as a parameter from host) matter at all?
Can an absence of sync.allrd instruction completely break the use of DPAS?
Why sync.allrd is not generated by IGC from the SPIRV produced by TVM? (see dump_tvm.zip above)

cc @mbelicki @mnaczk

Does the difference in how local memory is declared (global variable vs passed as a parameter from host) matter at all? -- The way how local mem is declared is decided by Frontends. Both ways are supported in IGC.
Can an absence of sync.allrd instruction completely break the use of DPAS? Why sync.allrd is not generated by IGC from the SPIRV produced by TVM? (see dump_tvm.zip above) -- No. Having sync.allrd or not is orthogonal to dpas. Those are swsb insertion behavior according to what instructions this dpas has dependency to. In SYCL case it has WAR dependency to previous store instructions, so sync.allrd is inserted to ensure the order. In TVM case, the dpas only has dependency to some previous mov instructions (none ld/st instructions) so sync.allrd is not required.

In SYCL case it has WAR dependency to previous store instructions, so sync.allrd is inserted to ensure the order. In TVM case, the dpas only has dependency to some previous mov instructions (none ld/st instructions) so sync.allrd is not required.

Interesting. The two programs, produced by SYCL and TVM, are supposed to do exactly the same thing. So if one program has WAR dep while the other doesn't, something must be off. Note that all of those mov before dpas in TVM is moving from the result of load.slm.d32.a32, which in turn should be loading from the result of store.slm.d32x4.a32 (I hope). Also note that, of the output matrix of shape 8x16 or 16x8, one 8x8 half of the output is always computed correctly. So the TVM program is not completely wrong.

If I compare the two LLVM IRs, they look pretty much the same to me. And the LLVM IR for TVM (shown below) clearly shows that matrix mad is operating on the result of two loads from SLM (X_shared and W_shared). So, can you tell me why IGC has decided that there is no WAR dep for TVM?

  %42 = getelementptr inbounds [256 x half], [256 x half] addrspace(3)* @X_shared, i32 %41
  %43 = addrspacecast [256 x half] addrspace(3)* %42 to half addrspace(4)*
  %44 = bitcast half addrspace(4)* %43 to i8 addrspace(4)*
  %45 = call %intel.joint_matrix_acc_8x16_f16_t addrspace(1)* @__builtin_spirv_OpJointMatrixLoadINTELacc_8x16_f16_p4i8_i32_i32(i8 addrspace(4)* %44, i32 16, i32 0)
  %46 = getelementptr inbounds [128 x half], [128 x half] addrspace(3)* @W_shared, i32 0
  %47 = addrspacecast [128 x half] addrspace(3)* %46 to half addrspace(4)*
  %48 = bitcast half addrspace(4)* %47 to i8 addrspace(4)*
  %49 = call %intel.joint_matrix_packedB_16x8_f16_t addrspace(1)* @__builtin_spirv_OpJointMatrixLoadINTELpackedB_16x8_f16_p4i8_i32_i32(i8 addrspace(4)* %48, i32 16, i32 3)
  %50 = call %intel.joint_matrix_acc_8x8_f32_t addrspace(1)* @__builtin_spirv_OpJointMatrixMadINTEL_i64_i64_i64(%intel.joint_matrix_acc_8x16_f16_t addrspace(1)* %45, %intel.joint_matrix_packedB_16x8_f16_t addrspace(1)* %49, %intel.joint_matrix_acc_8x8_f32_t addrspace(1)* %5)

So, can you tell me why IGC has decided that there is no WAR dep for TVM?

The different inputs (like you found, slm is passed in to the kernel differently) could've dramatically affected the final generated code. SWSB is work upon the final generated code instead of llvm-IR. There is no one-one mapping from llvm-IR to check the correctness of SWSB. In the case when I said "In SYCL case it has WAR dependency to previous store instructions", the store instructions mentioned here has nothing to do with the contents used by dpas (or at least, not directly used). The dependency is set only because the register picked for dpas and that store is the same (it does NOT imply the result for the store is used by the dpas).

I have no idea what's wrong with the TVM program yet, but from the discussion above, the absence of sync.allrd doesn't seem to be the issue. I'm closing this for now and I'll do more investigation.

intel / intel-graphics-compiler

Strange result when using DPAS with local memory #279