intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
131 stars 39 forks source link

[GEMM perf] Poor GEMM performance on A770 #1765

Open Egor-Krivov opened 2 months ago

Egor-Krivov commented 2 months ago

When I run GEMM benchmark on A770 I get about ~0.3 TFLOPs, while 1550 can get about 250 TFLOPs

Performance table: image

File with triton cache from the run (cache is in cache folder): benchmark-reports (6).zip

My run, just in case: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10215632110/job/28265440574

alexbaden commented 2 months ago

A770 does not support DPAS 16, so the kernel is likely a fully unrolled loop.

Egor-Krivov commented 2 months ago

What's our timeline for supporting fast GEMM on A770?

aregm commented 2 months ago

What's our timeline for supporting fast GEMM on A770?

What does it mean a fast GEMM on 770? It doesn't have DPAS, so it will lag behind. Do you mean efficiency?

Egor-Krivov commented 2 months ago

I think that current performance is lower than could be expected. Here is another GEMM benchmark (in milliseconds) using out matmul triton implementation against IPEX torch (onednn). We get about ~100x slowdown when use triton vs IPEX torch. image

alexbaden commented 2 months ago

Torch does not use Triton for GEMM - neither for XPU nor for CUDA. There is an existing, performant solution for GEMM in PyTorch on A770. Why do we need Triton to be competitive? We have line of sight to very good GEMM performance on hardware with DPAS instructions. On A770, we would be effectively starting over. What is the consumer demands that justifies such resource intensive work?

vlad-penkin commented 2 months ago

@alexbaden as per @whitneywhtsang 's comments in the issue

DPAS8 is supported via different OpenCL built-in.