Closed whitneywhtsang closed 4 months ago
Looks like it is a measurement issue, not a regression. The following table shows a standard deviation for 50 runs with 10s cool down time, same machine, same GPU, CPU.
Triton-GB/s XeTLA-GB/s Triton-GB/s-min XeTLA-GB/s-min \
N
256 33.113265 33.672442 67.900646 60.869305
1024 13.138909 128.768036 30.836839 106.255800
2048 44.865724 43.278793 48.657056 34.661737
4096 12.344805 27.195861 15.832075 15.758128
8192 10.715364 15.961425 6.229135 15.582343
16384 12.236177 10.630957 4.358528 5.668696
32768 7.133568 11.919674 5.517027 6.276609
Triton-GB/s-max XeTLA-GB/s-max Triton-TFlops XeTLA-TFlops \
N
256 50.471763 39.436989 0.033113 0.033673
1024 16.055949 17.980038 0.013139 0.128768
2048 11.104452 49.487229 0.044866 0.043279
4096 3.033624 12.550898 0.012345 0.027196
8192 28.955786 33.145384 0.010715 0.015961
16384 17.457486 7.491495 0.012236 0.010631
32768 3.139018 5.038861 0.007134 0.011920
Triton-TFlops-min XeTLA-TFlops-min Triton-TFlops-max \
N
256 0.067901 0.060869 0.050472
1024 0.030837 0.106256 0.016056
2048 0.048657 0.034662 0.011105
4096 0.015832 0.015758 0.003034
8192 0.006229 0.015582 0.028956
16384 0.004359 0.005669 0.017457
32768 0.005517 0.006277 0.003139
XeTLA-TFlops-max
N
256 0.039437
1024 0.017980
2048 0.049487
4096 0.012551
8192 0.033145
16384 0.007492
32768 0.005039
Reassigning to @chengjunlu to investigate why results for the same test can be so different, specifically for N = 1024 and 2048.
The standard deviation doesn't tell much whether the distribution are tightly packed or not. But:
Triton-GB/s
) of the bandwidth are more stable for the case N=4096, 8192, 16384.I will add the coefficient of variation to check whether there is too much variance in the micro-benchmark. (By approximate use the middle value as the average value, the CV is some 0.054 for N=2048. It seems Ok.)
For the original performance regression reported in this issue:
softmax-performance:
N Triton-GB/s XeTLA-GB/s
0 256.0 873.813292 794.375734
To
softmax-performance:
N Triton-GB/s XeTLA-GB/s
0 256.0 689.852662 771.011768
I met the same issue in my testing. It is caused by the Triton benchmark uses the SYCL barrier to do the auto-tuning and chooses a sub-optimal configuration for the case N=256
.
In the conclusion for what I should do in the further:
Double confirm the performance regression issue for the N=256 case. The performance is reproduced on PVC 1550 platform:
Triton autotuning for function softmax_kernel finished after 2.17s; best config selected: num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None; softmax-performance: N Triton Triton-min Triton-max 0 256.0 873.813292 819.200021 1092.266694
Triton autotuning for function softmax_kernel finished after 2.07s; best config selected: num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None;
softmax-performance:
N Triton Triton-min Triton-max
0 256.0 639.375598 609.637189 672.164151
The configuration with num_warps=8
is a sub-optimal configuration and the performance is similar to the one reported in this issue.
There was a chance in the test choses a sub-optimal configuration instead of performance regression of the code change.
Create a new issue to track the variance issue. https://github.com/intel/intel-xpu-backend-for-triton/issues/1566
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9456830167:
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9473973906: