intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

Fine tune sub-group transpose bank conflict prevention for PVC #2797

Open victor-eds opened 11 hours ago

victor-eds commented 11 hours ago

As of now, sub-group transpose bank conflict prevention leaves a single item every 17 items ((sub-group size = 16) + 1) to avoid bank conflicts:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
...

This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:

Assuming fp32 elements:

0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X

Again, for fp32, in terms of code:

; Store untransposed
call spir_funccc void @intel_sub_group_block_write8(ptr(3) %ptr0, <8 x float> %data)
%ptr1 = getelementptr inbounds %ptr0[130], float
; ...
; Load transposed
%vec0 = load<8 x float> %ptrwi0
%ptrwi1 = getelementptr inbounds %ptrwi0[1], <8 x float>
; ...
; Take into account empty elements
%ptrwi16 = getelementptr inbounds %ptrwi15[10], float