Open Egor-Krivov opened 2 months ago
A770 does not support DPAS 16, so the kernel is likely a fully unrolled loop.
What's our timeline for supporting fast GEMM on A770?
What's our timeline for supporting fast GEMM on A770?
What does it mean a fast GEMM on 770? It doesn't have DPAS, so it will lag behind. Do you mean efficiency?
I think that current performance is lower than could be expected. Here is another GEMM benchmark (in milliseconds) using out matmul triton implementation against IPEX torch (onednn). We get about ~100x slowdown when use triton vs IPEX torch.
Torch does not use Triton for GEMM - neither for XPU nor for CUDA. There is an existing, performant solution for GEMM in PyTorch on A770. Why do we need Triton to be competitive? We have line of sight to very good GEMM performance on hardware with DPAS instructions. On A770, we would be effectively starting over. What is the consumer demands that justifies such resource intensive work?
@alexbaden as per @whitneywhtsang 's comments in the issue
DPAS8 is supported via different OpenCL built-in.
When I run GEMM benchmark on A770 I get about ~
0.3 TFLOPs
, while 1550 can get about250 TFLOPs
Performance table:
File with triton cache from the run (cache is in
cache
folder): benchmark-reports (6).zipMy run, just in case: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10215632110/job/28265440574