[DPAS] The DPAS operation results are not correct for the D type is fp16 on ATS.

chengjunlu commented 9 months ago

In the cases to lower the tt.dot to dpas with the fp16 D type, the results of the DPAS is not correct.

The DPAS op in the MLIR with GenX dialect: %23884 = genx.matrix.dpas %23613, %23165, %23365 {pa = #genx.precision_type<FP16>, pb = #genx.precision_type<FP16>, rc = 8 : i32} : (vector<8xf16>, vector<8xi32>, vector<8xi32>) -> vector<8xf16> loc(#loc43)

The DPAS op in the LLVM IR: %2494 = call <8 x half> @llvm.genx.GenISA.sub.group.dpas.v8f16.v8f16.v8i32.v8i32(<8 x half> %2373, <8 x i32> %2246, <8 x i32> %2314, i32 10, i32 10, i32 8, i32 8, i8 0) #1, !dbg !2212

Fallback the data type of fp16-fp16-fp16-fp16 tt.dot to FMA as a workaround. Need further debug the issue.

chengjunlu commented 8 months ago

The GEMM test case with fp16-fp16-fp16-fp16 type on DPAS test can pass on PVC. But it fails on the ATSM platform.

Set it to P2 priority.

vlad-penkin commented 6 months ago

@alexbaden can we close this ticket?

alexbaden commented 6 months ago

I am looking at A770, not ATS. Without more details from @chengjunlu about which specific test failed I cannot determine if this ticket is still a problem.

chengjunlu commented 6 months ago

I am using the device ATSM "name='Intel(R) Arc(TM) A770 Graphics'". We hit accuracy issue in the fp16 test case. Like the one: test_op[16-64-64-1-1-2-None-None-None-False-False-float16-float16-None-True-None-None] in the test_matmul.py

E           Mismatched elements: 110 / 1024 (10.7%)
E           Greatest absolute difference: 0.126953125 at index (5, 18) (up to 1e-05 allowed)
E           Greatest relative difference: 0.49755859375 at index (5, 18) (up to 0.001 allowed)

But to support the ATSM DPAS instruction has been blocked by other issues.

Here is the blocking issue of my insight:

The Triton XPU has already switched to use the OCL interface for the DPAS. It only supports sub-group-size=8 for the ATSM DPAS.
We never supported the sub-group-size=8 with the packed i16 dtype for A operands. It requires a different layout supporting in the Triton code like: It is 8x16xf16 matrix A. Each color represents one single SIMT value.

We uses the GenISA and the different layout for testing this issue which requires extra changes in Triton XPU.

I think we need to support the ATSM DPAS by the public OCL interface at first.

alexbaden commented 6 months ago

My mistake, I thought ATSM was the codename for the Data Center Flex GPU series (Artic Sound). I can confirm these matmul tests are still failing on A770. Is there a ticket for supporting DPAS via public OCL interface? And can you link to the issue or pull request where we switched PVC to use the OCL interface for DPAS?

alexbaden commented 6 months ago

All float16/bfloat16 tests from test_matmul are failing:

python/triton/compiler/compiler.py:374: RuntimeError
===================================================================================================================================== short test summary info ======================================================================================================================================
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-False-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-True-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-False-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-True-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-False-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-True-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-False-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-True-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
====================================================================================================================== 8 failed, 806 passed, 4 skipped in 5005.86s (1:23:25) =======================================================================================================================

The error from the L0 driver is ZE_RESULT_ERROR_INVALID_KERNEL_NAME = 0x78000011, ///< [Validation] kernel name is not found in the module which is the same error as in some of the tests in #903

chengjunlu commented 6 months ago

My mistake, I thought ATSM was the codename for the Data Center Flex GPU series (Artic Sound). I can confirm these matmul tests are still failing on A770. Is there a ticket for supporting DPAS via public OCL interface? And can you link to the issue or pull request where we switched PVC to use the OCL interface for DPAS?

I think it is Triton XPU issue as we agreed to use the sub-group-size=8 for the DPAS on A770. An new issue https://github.com/intel/intel-xpu-backend-for-triton/issues/991 to track this.

etiotto commented 5 months ago

On ATSM we should use 8 threads per warp when using DPAS instructions rather than 16 because the OpenCL functions we need to eventually use support that number of threads per warp on ATSM (16 is supported on PVC). Triton codegen needs to be adapted.

intel / intel-xpu-backend-for-triton

[DPAS] The DPAS operation results are not correct for the D type is fp16 on ATS. #400