Closed masahi closed 1 year ago
In SYCL case it has WAR dependency to previous store instructions, so sync.allrd is inserted to ensure the order. In TVM case, the dpas only has dependency to some previous mov instructions (none ld/st instructions) so sync.allrd is not required.
Interesting. The two programs, produced by SYCL and TVM, are supposed to do exactly the same thing. So if one program has WAR dep while the other doesn't, something must be off. Note that all of those mov
before dpas in TVM is moving from the result of load.slm.d32.a32
, which in turn should be loading from the result of store.slm.d32x4.a32
(I hope). Also note that, of the output matrix of shape 8x16 or 16x8, one 8x8 half of the output is always computed correctly. So the TVM program is not completely wrong.
If I compare the two LLVM IRs, they look pretty much the same to me. And the LLVM IR for TVM (shown below) clearly shows that matrix mad is operating on the result of two loads from SLM (X_shared
and W_shared
). So, can you tell me why IGC has decided that there is no WAR dep for TVM?
%42 = getelementptr inbounds [256 x half], [256 x half] addrspace(3)* @X_shared, i32 %41
%43 = addrspacecast [256 x half] addrspace(3)* %42 to half addrspace(4)*
%44 = bitcast half addrspace(4)* %43 to i8 addrspace(4)*
%45 = call %intel.joint_matrix_acc_8x16_f16_t addrspace(1)* @__builtin_spirv_OpJointMatrixLoadINTELacc_8x16_f16_p4i8_i32_i32(i8 addrspace(4)* %44, i32 16, i32 0)
%46 = getelementptr inbounds [128 x half], [128 x half] addrspace(3)* @W_shared, i32 0
%47 = addrspacecast [128 x half] addrspace(3)* %46 to half addrspace(4)*
%48 = bitcast half addrspace(4)* %47 to i8 addrspace(4)*
%49 = call %intel.joint_matrix_packedB_16x8_f16_t addrspace(1)* @__builtin_spirv_OpJointMatrixLoadINTELpackedB_16x8_f16_p4i8_i32_i32(i8 addrspace(4)* %48, i32 16, i32 3)
%50 = call %intel.joint_matrix_acc_8x8_f32_t addrspace(1)* @__builtin_spirv_OpJointMatrixMadINTEL_i64_i64_i64(%intel.joint_matrix_acc_8x16_f16_t addrspace(1)* %45, %intel.joint_matrix_packedB_16x8_f16_t addrspace(1)* %49, %intel.joint_matrix_acc_8x8_f32_t addrspace(1)* %5)
So, can you tell me why IGC has decided that there is no WAR dep for TVM?
The different inputs (like you found, slm is passed in to the kernel differently) could've dramatically affected the final generated code. SWSB is work upon the final generated code instead of llvm-IR. There is no one-one mapping from llvm-IR to check the correctness of SWSB. In the case when I said "In SYCL case it has WAR dependency to previous store instructions", the store instructions mentioned here has nothing to do with the contents used by dpas (or at least, not directly used). The dependency is set only because the register picked for dpas and that store is the same (it does NOT imply the result for the store is used by the dpas).
I have no idea what's wrong with the TVM program yet, but from the discussion above, the absence of sync.allrd
doesn't seem to be the issue. I'm closing this for now and I'll do more investigation.
Hi, I'm continuing to investigate the strange correctness issue I described in https://github.com/intel/intel-graphics-compiler/issues/275 (one of two DPAS calls is returning a zero matrix, i.e. the values of
C
matrix inA * B + C
). I found that the issue arises when all of the followings are true:(1) DPAS is used (2) local memory is used (3) Multiple DPAS are executed by a single workgroup, either via a single subgroup doing multiple DPAS or multiple subgroups (say, in a workgroup with 2 x 8 threads) doing one DPAS each.
Any combination of the two conditions above works fine, so for example if I don't use local memory or DPAS at all, everything is fine. Or using DPAS + local memory to compute a single fp16 m8n8k16 matmul is also fine.
I wrote a SYCL program that demonstrates what I'm trying to do with a custom DSL + SPIRV code generator (TVM). https://gist.github.com/masahi/92f45c2f9ab456259fbb8903c864b2d3
Interestingly, the SYCL program above works correctly. So I tried to compare the difference in the generated SPIRV between clang and the custom code generator in TVM, and also the final ISA generated for the two cases.
Here are the two IGC dumps: dump_sycl.zip (result is correct) dump_tvm.zip (result is incorrect)
The two SPIRV look very similar, the only big difference is how the local memory is declared. While the SYCL one passes local memory as a argument to the kernel, the TVM one declares it as a global variable with a static allocation size. So for the latter case, LLVM IR begins with:
And comparing the two ISA, I found that the SYCL one has
sync.allrd
instruction before DPAS:while the TVM doesn't:
If I remove the use of local memory from the SYCL program, the
sync.allrd
is not generated. So I'm assuming that the presence ofsync.allrd
is unique to the usage of local memory, and if the absence of a necessarysync.allrd
can cause a trouble for DPAS, that would explain the correctness issue I'm observing. I spent a lot of time debugging this issue, and I'm increasingly leaning toward thinking that this is another IGC miscompilation bug, similar to https://github.com/intel/intel-graphics-compiler/issues/275.So here are my questions:
sync.allrd
instruction completely break the use of DPAS?sync.allrd
is not generated by IGC from the SPIRV produced by TVM? (see dump_tvm.zip above)cc @mbelicki @mnaczk