Performance difference for GEMM & flash-attention between reported and reproduced in CI

Egor-Krivov commented 2 months ago

Current CI runs POC branches for flash-attention perf_attn and for GEMMs triton_perf_poc and runs provided run_all.sh scripts. Custom libigc is installed libigc1_1.0.24994.16243-igc+releaseinternal1_amd64.deb. However, there is a significant performance difference between reproduced and reported.

For example in flash attention CI reproduction gets about 2.5TFlops for Z:1,H:32,N_CTX:16384,D_HEAD:64, while reported value is close to 81 TFlops. Similar difference exists for GEMM results.

Results can be accessed as CI artefacts here (including raw summary files from run_all.sh): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9637082068

Egor-Krivov commented 2 months ago

Results from artefacts (raw summary):

GEMM:

B, M, K, N, avg_tflops, avg_gbs, max_tflops, max_gbs, min_tflops, min_gbs
1, 4096, 4096, 4096, 261.394, 255.267, 286.905, 280.181, 188.005, 183.598
1, 8192, 8192, 8192, 31.9581, 31.2091, 37.9213, 37.0325, 26.6058, 25.9822
1, 1, 5120, 13824, 434.701, 424.512, 451.389, 440.81, 424.404, 414.457
1, 1024, 28672, 8192, 69.5951, 67.964, 80.0478, 78.1717, 58.9684, 57.5864
1, 3072, 4096, 3072, 381.471, 372.53, 413.773, 404.076, 305.692, 298.527
1, 4, 4096, 12288, 601.675, 587.573, 623.815, 609.194, 587.547, 573.776
1, 512, 8192, 8192, 445.235, 434.8, 469.909, 458.895, 338.586, 330.651
1, 512, 8192, 32768, 121.568, 118.718, 128.496, 125.484, 107.995, 105.464
1, 512, 32768, 8192, 121.461, 118.614, 126.639, 123.671, 98.328, 96.0234
1, 16384, 8192, 1024, 130.514, 127.455, 144.344, 140.961, 120.713, 117.884
1, 16384, 1024, 8192, 112.273, 109.642, 124.204, 121.293, 97.8352, 95.5422
1, 16384, 8192, 4096, 31.5405, 30.8013, 36.2766, 35.4264, 21.5012, 20.9972
1, 16384, 4096, 8192, 31.7082, 30.9651, 37.5205, 36.6411, 27.774, 27.123
1, 4096, 16384, 8192, 30.2546, 29.5455, 34.0357, 33.238, 27.2688, 26.6297
1, 8192, 16384, 4096, 32.4939, 31.7323, 34.2256, 33.4234, 30.157, 29.4502
1, 1024, 16384, 8192, 130.821, 127.755, 142.999, 139.647, 114.839, 112.147
1, 8192, 16384, 1024, 130.609, 127.548, 143.957, 140.583, 111.834, 109.212
4096, 8, 128, 16384, 6.06581, 5.92364, 6.07427, 5.93191, 6.05292, 5.91105
4096, 8, 16384, 128, 6.16977, 6.02517, 6.24209, 6.0958, 6.07978, 5.93728
4, 32768, 128, 4096, 64.0826, 62.5806, 64.2622, 62.7561, 63.9322, 62.4338
4, 32768, 4096, 128, 81.9609, 80.04, 82.6432, 80.7063, 78.0406, 76.2116
32, 4096, 4096, 128, 73.6363, 71.9105, 75.1854, 73.4233, 72.7529, 71.0478

Attention:

Z, H, N_CTX, D_HEAD, avg_tflops, max_tflops, min_tflops
4, 48, 1024, 64, 103.074, 109.864, 93.3959
32, 32, 512, 64, 54.7735, 55.158, 53.3934
16, 32, 1024, 64, 36.2332, 39.1543, 32.6167
8, 32, 2048, 64, 18.5404, 20.9838, 16.9378
4, 32, 4096, 64, 9.80752, 10.3433, 9.44502
2, 32, 8192, 64, 4.92127, 5.05735, 4.71036
1, 32, 16384, 64, 2.28198, 2.46346, 2.16441
32 16 512 128
16 16 1024 128
8 16 2048 128
4 16 4096 128
2 16 8192 128
1 16 16384 128

ESI-SYD commented 2 months ago

I can reproduce reported numbers in local IDC (MAX 1550) server for both cases with:
```
pytorch-gpu-dev-0.5.1
gfx-driver-ci-comp_igc-24994
```
Also similar data when changing to use custom libigc1_1.0.24994.16243-igc+releaseinternal1_amd64.deb.
The CI workflow looks good, not found big difference in dumped ir and asm files between CI data and local ones.

Does this machine dedicated ? I see there are 2 GPU cards with max frequency 1600, Can you check their health using like: xpu-smi health -l or xpu-smi dump -d 0 -m 0,1,2 -i 1 -n 5

Egor-Krivov commented 2 months ago

I can reproduce reported numbers in local IDC (MAX 1550) server for both cases with:
pytorch-gpu-dev-0.5.1
gfx-driver-ci-comp_igc-24994
Also similar data when changing to use custom libigc1_1.0.24994.16243-igc+releaseinternal1_amd64.deb.

The CI workflow looks good, not found big difference in dumped ir and asm files between CI data and local ones.

Does this machine dedicated ? I see there are 2 GPU cards with max frequency 1600, Can you check their health using like: xpu-smi health -l or xpu-smi dump -d 0 -m 0,1,2 -i 1 -n 5

GPU is dedicated to only one job, machine can be shared between jobs. So CPU hypothetically could become the bottleneck. Could CPU be the bottleneck here?

ESI-SYD commented 1 month ago

I think CPU should have limited impact.

Latest build failed to benchmark gemm and attention (llvm-target): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9830987255

But it cannot be reproduced in other branch(like a mirror branch): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9833038783

Needs to double confirm with this.

Also, Can you prepare QuickBuild version in runner to see if impacts?

ESI-SYD commented 1 month ago

This benchmarks on POC branches will not be merged into the main branch directly, We are integrating them into Triton benchmark like softmax now.

pbchekin commented 1 month ago

Revisit after integrating to the POC branch to the default one.

ESI-SYD commented 1 month ago

Depends on

1450
1451

Dewei-Wang-sh commented 1 month ago

since #1450 is closed, please check there should not be perf diff @ESI-SYD @LiyangLingIntel

Egor-Krivov commented 2 weeks ago

I've checked latest GEMM results in CI after all the merges. And GEMM results are now similar to what we had in perf report for 03.06.2024.

So we no longer have gap for GEMM results.

Just FA is remaining, blocked by https://github.com/intel/intel-xpu-backend-for-triton/issues/1451

ESI-SYD commented 1 week ago

Update: GEMM done FlashAttention blocked by it's unavailability on the llvm-target branch.

Assertion `detail::isPresent(Val) && "dyn_cast on a non-existent value"' failed.

Tracked in: https://github.com/intel/intel-xpu-backend-for-triton/issues/1758

ESI-SYD commented 3 days ago

Update: Triton's FlashAttention wip #1758 XeTLA's FlashAttention done #1877

intel / intel-xpu-backend-for-triton

Performance difference for GEMM & flash-attention between reported and reproduced in CI #1470

1450

1451