intel / intel-graphics-compiler

Other
596 stars 155 forks source link

Strange result when using DPAS with local memory #279

Closed masahi closed 1 year ago

masahi commented 1 year ago

Hi, I'm continuing to investigate the strange correctness issue I described in https://github.com/intel/intel-graphics-compiler/issues/275 (one of two DPAS calls is returning a zero matrix, i.e. the values of C matrix in A * B + C). I found that the issue arises when all of the followings are true:

(1) DPAS is used (2) local memory is used (3) Multiple DPAS are executed by a single workgroup, either via a single subgroup doing multiple DPAS or multiple subgroups (say, in a workgroup with 2 x 8 threads) doing one DPAS each.

Any combination of the two conditions above works fine, so for example if I don't use local memory or DPAS at all, everything is fine. Or using DPAS + local memory to compute a single fp16 m8n8k16 matmul is also fine.

I wrote a SYCL program that demonstrates what I'm trying to do with a custom DSL + SPIRV code generator (TVM). https://gist.github.com/masahi/92f45c2f9ab456259fbb8903c864b2d3

Interestingly, the SYCL program above works correctly. So I tried to compare the difference in the generated SPIRV between clang and the custom code generator in TVM, and also the final ISA generated for the two cases.

Here are the two IGC dumps: dump_sycl.zip (result is correct) dump_tvm.zip (result is incorrect)

The two SPIRV look very similar, the only big difference is how the local memory is declared. While the SYCL one passes local memory as a argument to the kernel, the TVM one declares it as a global variable with a static allocation size. So for the latter case, LLVM IR begins with:

@main_kernel0.X_shared = addrspace(3) global [256 x float] undef, section "localSLM", align 4
@main_kernel0.W_shared = addrspace(3) global [128 x float] undef, section "localSLM", align 4
...
define spir_kernel void @main_kernel0(float addrspace(1)* noalias %X, float addrspace(1)* noalias %W, float addrspace(1)* noalias %compute, <8 x i32> %r0, <8 x i32> %payloadHeader, i16 %localIdX, i16 %localIdY, i16 %localIdZ, i32 %bufferOffset, i32 %bufferOffset1, i32 %bufferOffset2) #0 {
entry:
..

And comparing the two ISA, I found that the SYCL one has sync.allrd instruction before DPAS:

        sync.nop                             null                             {Compacted,A@1}        // $202
        sync.allrd                           ($0,$1)                                                 // $202
        dpas.8x8 (8|M0)          r31:f         null:f            r23:hf            r63.0:hf         {$8} // $202

while the TVM doesn't:

        sync.nop                             null                             {Compacted,F@2}        // $74
        dpas.8x8 (8|M0)          r77:f         null:f            r69:hf            r45.0:hf         {$8} // $74

If I remove the use of local memory from the SYCL program, the sync.allrd is not generated. So I'm assuming that the presence of sync.allrd is unique to the usage of local memory, and if the absence of a necessary sync.allrd can cause a trouble for DPAS, that would explain the correctness issue I'm observing. I spent a lot of time debugging this issue, and I'm increasingly leaning toward thinking that this is another IGC miscompilation bug, similar to https://github.com/intel/intel-graphics-compiler/issues/275.

So here are my questions:

cc @mbelicki @mnaczk

DianaChen commented 1 year ago
masahi commented 1 year ago

In SYCL case it has WAR dependency to previous store instructions, so sync.allrd is inserted to ensure the order. In TVM case, the dpas only has dependency to some previous mov instructions (none ld/st instructions) so sync.allrd is not required.

Interesting. The two programs, produced by SYCL and TVM, are supposed to do exactly the same thing. So if one program has WAR dep while the other doesn't, something must be off. Note that all of those mov before dpas in TVM is moving from the result of load.slm.d32.a32, which in turn should be loading from the result of store.slm.d32x4.a32 (I hope). Also note that, of the output matrix of shape 8x16 or 16x8, one 8x8 half of the output is always computed correctly. So the TVM program is not completely wrong.

If I compare the two LLVM IRs, they look pretty much the same to me. And the LLVM IR for TVM (shown below) clearly shows that matrix mad is operating on the result of two loads from SLM (X_shared and W_shared). So, can you tell me why IGC has decided that there is no WAR dep for TVM?

  %42 = getelementptr inbounds [256 x half], [256 x half] addrspace(3)* @X_shared, i32 %41
  %43 = addrspacecast [256 x half] addrspace(3)* %42 to half addrspace(4)*
  %44 = bitcast half addrspace(4)* %43 to i8 addrspace(4)*
  %45 = call %intel.joint_matrix_acc_8x16_f16_t addrspace(1)* @__builtin_spirv_OpJointMatrixLoadINTELacc_8x16_f16_p4i8_i32_i32(i8 addrspace(4)* %44, i32 16, i32 0)
  %46 = getelementptr inbounds [128 x half], [128 x half] addrspace(3)* @W_shared, i32 0
  %47 = addrspacecast [128 x half] addrspace(3)* %46 to half addrspace(4)*
  %48 = bitcast half addrspace(4)* %47 to i8 addrspace(4)*
  %49 = call %intel.joint_matrix_packedB_16x8_f16_t addrspace(1)* @__builtin_spirv_OpJointMatrixLoadINTELpackedB_16x8_f16_p4i8_i32_i32(i8 addrspace(4)* %48, i32 16, i32 3)
  %50 = call %intel.joint_matrix_acc_8x8_f32_t addrspace(1)* @__builtin_spirv_OpJointMatrixMadINTEL_i64_i64_i64(%intel.joint_matrix_acc_8x16_f16_t addrspace(1)* %45, %intel.joint_matrix_packedB_16x8_f16_t addrspace(1)* %49, %intel.joint_matrix_acc_8x8_f32_t addrspace(1)* %5)
DianaChen commented 1 year ago

So, can you tell me why IGC has decided that there is no WAR dep for TVM?

The different inputs (like you found, slm is passed in to the kernel differently) could've dramatically affected the final generated code. SWSB is work upon the final generated code instead of llvm-IR. There is no one-one mapping from llvm-IR to check the correctness of SWSB. In the case when I said "In SYCL case it has WAR dependency to previous store instructions", the store instructions mentioned here has nothing to do with the contents used by dpas (or at least, not directly used). The dependency is set only because the register picked for dpas and that store is the same (it does NOT imply the result for the store is used by the dpas).

masahi commented 1 year ago

I have no idea what's wrong with the TVM program yet, but from the discussion above, the absence of sync.allrd doesn't seem to be the issue. I'm closing this for now and I'll do more investigation.