intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
101 stars 29 forks source link

[IGC][operators/test_matmul.py::test_op] one test variant fails after the upgrade to 1.0.16510.18 #1182

Closed vlad-penkin closed 2 days ago

vlad-penkin commented 1 month ago

IGC version - 1.0.16510.18 Test variant - [128-256-64-1-8-3-256-512-160-True-True-float32-float32-None-True-None-None]

vlad-penkin commented 1 month ago

The issue is still reproducible with the new Agama Rolling 881.19

whitneywhtsang commented 4 weeks ago

The issue is still reproducible with open-linux-driver-ci-dev_igc-17139.

whitneywhtsang commented 4 weeks ago

After https://github.com/intel/intel-xpu-backend-for-triton/pull/853, there are two more failures, likely due to the same problem:

FAILED operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-False-float32-float32-None-True-None-None] - AssertionError: Tensor-likes are not close!
FAILED operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-False-float32-float32-None-True-None-None] - AssertionError: Tensor-likes are not close!
AshburnLee commented 2 weeks ago
whitneywhtsang commented 2 weeks ago
  • May I know what is corresponding agama version of 1.0.16510.18?
  • May I know the target branch or commit? @whitneywhtsang @vlad-penkin

You can check the IGC version by dpkg -l | grep libigc1. For this particular issue, it starts to fail in agama 881.12. Please check if it passes on the pre-release driver 914.16 at Triton commit 61042a1031e97d2f0b39139ba324f8dc5e8294b3 with https://github.com/intel/intel-xpu-backend-for-triton/pull/1443.

AshburnLee commented 2 weeks ago

self = <triton.compiler.compiler.CompiledKernel object at 0x7f3a94586a10>

def _init_handles(self):
    if self.module is not None:
        return
    device = driver.active.get_current_device()
    # create launcher
    self.run = driver.active.launcher_cls(self.src, self.metadata)
    # not enough shared memory to run the kernel
    max_shared = driver.active.utils.get_device_properties(device)["max_shared_mem"]
    if self.metadata.shared > max_shared:
        raise OutOfResources(self.metadata.shared, max_shared, "shared memory")
    # TODO: n_regs, n_spills should be metadata generated when calling `ptxas`
  self.module, self.function, self.n_regs, self.n_spills = driver.active.utils.load_binary(

self.name, self.kernel, self.metadata.shared, device) E RuntimeError: Triton Error [ZE]: 0x78000018

python/triton/compiler/compiler.py:376: RuntimeError ============================================================== warnings summary ============================================================== ../../mambaforge/envs/junhui-py310/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/tpp/init.py:1 /home/lijunhui/mambaforge/envs/junhui-py310/lib/python3.10/site-packages/intel_extension_for_pytorch/cpu/tpp/init.py:1: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================== short test summary info =========================================================== FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-True-float32-float32-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000018 ======================================================= 1 failed, 1 warning in 12.77s ========================================================

whitneywhtsang commented 2 weeks ago
  • May I know what is corresponding agama version of 1.0.16510.18?
  • May I know the target branch or commit? @whitneywhtsang @vlad-penkin

You can check the IGC version by dpkg -l | grep libigc1. For this particular issue, it starts to fail in agama 881.12. Please check if it passes on the pre-release driver 914.16 at Triton commit 61042a1 with #1443.

I followed the setup descripted here, and the 3 matmul tests are passing on 914.16 driver + 0.5.2 PTDB.

tdeng5 commented 1 week ago

Could we automate these kinds of test, like: auto detect and switch drivers, CIs etc, then auto send the results via mail?