[softmax] Investigate performance variation / degradation from c141986 to 93d168c

whitneywhtsang commented 4 months ago

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9456830167:

softmax-performance:
         N  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max
0    256.0   873.813292  794.375734       794.375734      728.177767      1008.246151      970.903637       0.873813      0.794376           0.794376          0.728178           1.008246          0.970904
1   1024.0   881.156230  866.591724       873.813292      765.383967       903.944839     1165.084464       0.881156      0.866592           0.873813          0.765384           0.903945          1.165084
2   2048.0   791.378067  768.187527       730.714991      706.111777       859.488541     1327.311359       0.791378      0.768188           0.730715          0.706112           0.859489          1.327311
3   4096.0   797.396197  761.216691       754.371228      706.111777       832.203143      998.643845       0.797396      0.761217           0.754371          0.706112           0.832203          0.998644
4   8192.0   796.638895  741.370615       775.287282      720.052203       916.787824      800.439717       0.796639      0.741371           0.775287          0.720052           0.916788          0.800440
5  16384.0   772.787504  752.004306       759.493672      747.647726       840.963184      799.676697       0.772788      0.752004           0.759494          0.747648           0.840963          0.799677
6  32768.0   847.976547  838.755988       838.232150      834.064957       897.417251      865.474099       0.847977      0.838756           0.838232          0.834065           0.897417          0.865474

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9473973906:

softmax-performance:
         N  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max
0    256.0   689.852662  771.011768       655.359969      672.164151       748.982884      936.228546       0.689853      0.771012           0.655360          0.672164           0.748983          0.936229
1   1024.0   888.623801  870.187483       866.591724      825.650389       896.218846     1233.618807       0.888624      0.870187           0.866592          0.825650           0.896219          1.233619
2   2048.0   814.428019  773.856785       773.856785      713.316998       888.623801     1086.607304       0.814428      0.773857           0.773857          0.713317           0.888624          1.086607
3   4096.0   782.519359  761.216691       737.136026      697.887506       824.028273     1059.167707       0.782519      0.761217           0.737136          0.697888           0.824028          1.059168
4   8192.0   778.163998  732.949595       761.908050      712.710919       880.231740      796.638965       0.778164      0.732950           0.761908          0.712711           0.880232          0.796639
5  16384.0   765.034936  746.649602       757.094611      739.410136       828.504491      776.363496       0.765035      0.746650           0.757095          0.739410           0.828504          0.776363
6  32768.0   839.490353  838.546327       830.966595      828.095533       894.307917      848.619924       0.839490      0.838546           0.830967          0.828096           0.894308          0.848620

pbchekin commented 4 months ago

Looks like it is a measurement issue, not a regression. The following table shows a standard deviation for 50 runs with 10s cool down time, same machine, same GPU, CPU.

       Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  \
N                                                                 
256      33.113265   33.672442        67.900646       60.869305   
1024     13.138909  128.768036        30.836839      106.255800   
2048     44.865724   43.278793        48.657056       34.661737   
4096     12.344805   27.195861        15.832075       15.758128   
8192     10.715364   15.961425         6.229135       15.582343   
16384    12.236177   10.630957         4.358528        5.668696   
32768     7.133568   11.919674         5.517027        6.276609   

       Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  \
N                                                                     
256          50.471763       39.436989       0.033113      0.033673   
1024         16.055949       17.980038       0.013139      0.128768   
2048         11.104452       49.487229       0.044866      0.043279   
4096          3.033624       12.550898       0.012345      0.027196   
8192         28.955786       33.145384       0.010715      0.015961   
16384        17.457486        7.491495       0.012236      0.010631   
32768         3.139018        5.038861       0.007134      0.011920   

       Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  \
N                                                               
256             0.067901          0.060869           0.050472   
1024            0.030837          0.106256           0.016056   
2048            0.048657          0.034662           0.011105   
4096            0.015832          0.015758           0.003034   
8192            0.006229          0.015582           0.028956   
16384           0.004359          0.005669           0.017457   
32768           0.005517          0.006277           0.003139   

       XeTLA-TFlops-max  
N                        
256            0.039437  
1024           0.017980  
2048           0.049487  
4096           0.012551  
8192           0.033145  
16384          0.007492  
32768          0.005039

pbchekin commented 4 months ago

Reassigning to @chengjunlu to investigate why results for the same test can be so different, specifically for N = 1024 and 2048.

chengjunlu commented 4 months ago

The standard deviation doesn't tell much whether the distribution are tightly packed or not. But:

The Triton time are more stable than the XeTLA time in the micro-bench.
The middle value (Triton-GB/s) of the bandwidth are more stable for the case N=4096, 8192, 16384.

I will add the coefficient of variation to check whether there is too much variance in the micro-benchmark. (By approximate use the middle value as the average value, the CV is some 0.054 for N=2048. It seems Ok.)

For the original performance regression reported in this issue:

softmax-performance:
         N  Triton-GB/s  XeTLA-GB/s  
0    256.0   873.813292  794.375734

To

softmax-performance:
         N  Triton-GB/s  XeTLA-GB/s 
0    256.0   689.852662  771.011768

I met the same issue in my testing. It is caused by the Triton benchmark uses the SYCL barrier to do the auto-tuning and chooses a sub-optimal configuration for the case N=256.

In the conclusion for what I should do in the further:

Add CV value to check whether the performance is stable.
Investigate why the N shape < 4K has more variance in performance. (Suspect it is related to the thread dispatching and scheduling in GPU.)
Double confirm the performance regression issue for the N=256 case.

chengjunlu commented 4 months ago

Double confirm the performance regression issue for the N=256 case. The performance is reproduced on PVC 1550 platform:
Triton autotuning for function softmax_kernel finished after 2.17s; best config selected: num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;
softmax-performance:
N      Triton  Triton-min   Triton-max
0    256.0  873.813292  819.200021  1092.266694

Triton autotuning for function softmax_kernel finished after 2.07s; best config selected: num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None;
softmax-performance:
         N      Triton  Triton-min  Triton-max
0    256.0  639.375598  609.637189  672.164151

The configuration with num_warps=8 is a sub-optimal configuration and the performance is similar to the one reported in this issue.

There was a chance in the test choses a sub-optimal configuration instead of performance regression of the code change.

chengjunlu commented 4 months ago

Create a new issue to track the variance issue. https://github.com/intel/intel-xpu-backend-for-triton/issues/1566

intel / intel-xpu-backend-for-triton

[softmax] Investigate performance variation / degradation from c141986 to 93d168c #1350