intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
141 stars 43 forks source link

Investigate and re-enable two matmul tests #381

Closed whitneywhtsang closed 7 months ago

whitneywhtsang commented 9 months ago

Merging https://github.com/intel/intel-xpu-backend-for-triton/commit/e4c91aeb43cbc9743272c19002901c37087b7370 causes two matmul regression:

=========================== short test summary info ============================
FAILED operators/test_matmul.py::test_op[128-256-32-1-8-2-None-None-None-False-False-float16-float16-True-True-None-float32] - AssertionError: Tensor-likes are not close!

Mismatched elements: 32768 / 32768 (100.0%)
Greatest absolute difference: 1.3355178833007812 at index (43, 175) (up to 1e-05 allowed)
Greatest relative difference: 5.01923669153143e+37 at index (5, 175) (up to 1.3e-06 allowed)
FAILED operators/test_matmul.py::test_op[128-256-32-1-8-2-None-None-None-False-False-float16-float16-True-True-float32-float32] - AssertionError: Tensor-likes are not close!

Mismatched elements: 32640 / 32768 (99.6%)
Greatest absolute difference: 1.3355178833007812 at index (43, 175) (up to 1e-05 allowed)
Greatest relative difference: 5.01923669153143e+37 at index (5, 175) (up to 1.3e-06 allowed)
============ 2 failed, 703 passed, 164 skipped in 181.66s (0:03:01) ============

They are skipped in https://github.com/intel/intel-xpu-backend-for-triton/pull/380. This issue is to investigate the cause of the regression, fix it, and re-enable the two skipped tests.

To reproduce:

git revert 039278477049eadb4bb66bb997871b55cf1de191
python3 -m pytest --verbose python/test/unit/operators/test_matmul.py
whitneywhtsang commented 9 months ago

Related issue: https://github.com/intel/intel-xpu-backend-for-triton/issues/257

etiotto commented 9 months ago

Tests pass when the produced LLVM IR is compiled by LLVM at noopt, and the fail when compiling with -O3. Narrowed down the optimization in LLVM that, if removed from the LLVM optimization pipeline, causes the tests to pass. The optimization that exposes the failure is the "InstCombinePass"

etiotto commented 9 months ago

Disabling code sinking in the LLVM instcombine pass makes the tests pass.

vlad-penkin commented 9 months ago

@etiotto can we close this ticket?

etiotto commented 9 months ago

@etiotto can we close this ticket?

No the problem is not fixed yet.

etiotto commented 9 months ago

In LLVM's instruction combining, code-sinking is only exposing the problem. Test case passes if up to 279 SSA values are sinked inside a branch, and fails if 280 SSA values are sinked. The 280th instruction sinked is a shl instruction (%116 = shl i32 %7, 5, !dbg !47 below).

  define spir_kernel void @_kernel_0d1d2d3de4de5de6de7c8de9c10de11c(ptr addrspace(1) nocapture readonly %0, ptr addrspace(1) nocapture readonly %1, ptr addrspace(1) nocapture writeonly %2, i32 %3, i32 %4, i32 %5, i32 %6, i32 %7, i32 %8, ptr addrspace(3) %9) local_unnamed_addr !dbg !9 !intel_reqd_sub_group_size !11 !max_work_group_size !12 {
  ...
  %113 = add i32 %5, 31, !dbg !43
  %114 = sdiv i32 %113, 32, !dbg !45
  %115 = icmp sgt i32 %113, 31, !dbg !46
  br i1 %115, label %.lr.ph, label %._crit_edge, !dbg !46

.lr.ph:                                           ; preds = %10
  **%116 = shl i32 %7, 5, !dbg !47**

This should be fine so we suspect the problem is caused by incorrect compilation of the generated SPV file by the Intel graphic compiler (IGC).

whitneywhtsang commented 7 months ago

The two skipped matmul tests are now both enabled and passing, can we close this issue?

etiotto commented 7 months ago

This is now working.