Closed chengjunlu closed 5 months ago
The GEMM test case with fp16-fp16-fp16-fp16 type on DPAS test can pass on PVC. But it fails on the ATSM platform.
Set it to P2 priority.
@alexbaden can we close this ticket?
I am looking at A770, not ATS. Without more details from @chengjunlu about which specific test failed I cannot determine if this ticket is still a problem.
I am using the device ATSM "name='Intel(R) Arc(TM) A770 Graphics'".
We hit accuracy issue in the fp16 test case. Like the one: test_op[16-64-64-1-1-2-None-None-None-False-False-float16-float16-None-True-None-None]
in the test_matmul.py
E Mismatched elements: 110 / 1024 (10.7%)
E Greatest absolute difference: 0.126953125 at index (5, 18) (up to 1e-05 allowed)
E Greatest relative difference: 0.49755859375 at index (5, 18) (up to 0.001 allowed)
But to support the ATSM DPAS instruction has been blocked by other issues.
Here is the blocking issue of my insight:
We uses the GenISA and the different layout for testing this issue which requires extra changes in Triton XPU.
I think we need to support the ATSM DPAS by the public OCL interface at first.
My mistake, I thought ATSM was the codename for the Data Center Flex GPU series (Artic Sound). I can confirm these matmul tests are still failing on A770. Is there a ticket for supporting DPAS via public OCL interface? And can you link to the issue or pull request where we switched PVC to use the OCL interface for DPAS?
All float16/bfloat16 tests from test_matmul
are failing:
python/triton/compiler/compiler.py:374: RuntimeError
===================================================================================================================================== short test summary info ======================================================================================================================================
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-False-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-True-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-False-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-True-float16-float16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-False-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-False-True-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-False-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
FAILED python/test/unit/operators/test_matmul.py::test_op[128-256-64-1-8-3-256-512-160-True-True-bfloat16-bfloat16-None-True-None-None] - RuntimeError: Triton Error [ZE]: 0x78000011
====================================================================================================================== 8 failed, 806 passed, 4 skipped in 5005.86s (1:23:25) =======================================================================================================================
The error from the L0 driver is ZE_RESULT_ERROR_INVALID_KERNEL_NAME = 0x78000011, ///< [Validation] kernel name is not found in the module
which is the same error as in some of the tests in #903
My mistake, I thought ATSM was the codename for the Data Center Flex GPU series (Artic Sound). I can confirm these matmul tests are still failing on A770. Is there a ticket for supporting DPAS via public OCL interface? And can you link to the issue or pull request where we switched PVC to use the OCL interface for DPAS?
I think it is Triton XPU issue as we agreed to use the sub-group-size=8 for the DPAS on A770. An new issue https://github.com/intel/intel-xpu-backend-for-triton/issues/991 to track this.
On ATSM we should use 8 threads per warp when using DPAS instructions rather than 16 because the OpenCL functions we need to eventually use support that number of threads per warp on ATSM (16 is supported on PVC). Triton codegen needs to be adapted.
In the cases to lower the
tt.dot
to dpas with the fp16 D type, the results of the DPAS is not correct.The DPAS op in the MLIR with GenX dialect:
%23884 = genx.matrix.dpas %23613, %23165, %23365 {pa = #genx.precision_type<FP16>, pb = #genx.precision_type<FP16>, rc = 8 : i32} : (vector<8xf16>, vector<8xi32>, vector<8xi32>) -> vector<8xf16> loc(#loc43)
The DPAS op in the LLVM IR:
%2494 = call <8 x half> @llvm.genx.GenISA.sub.group.dpas.v8f16.v8f16.v8i32.v8i32(<8 x half> %2373, <8 x i32> %2246, <8 x i32> %2314, i32 10, i32 10, i32 8, i32 8, i8 0) #1, !dbg !2212
Fallback the data type of fp16-fp16-fp16-fp16
tt.dot
to FMA as a workaround. Need further debug the issue.