Fine tune sub-group transpose bank conflict prevention for PVC

As of now, sub-group transpose bank conflict prevention leaves a single item every 17 items ((sub-group size = 16) + 1) to avoid bank conflicts:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
...

This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:

Store (64 banks * 8 B/bank / X B/element) elements
Leave (1 bank * 8 B/bank / X B/element) empty spots
Store (64 banks * 8 B/bank / X B/element) elements
...

Assuming fp32 elements:

0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X

Again, for fp32, in terms of code:

; Store untransposed
call spir_funccc void @intel_sub_group_block_write8(ptr(3) %ptr0, <8 x float> %data)
%ptr1 = getelementptr inbounds %ptr0[130], float
; ...
; Load transposed
%vec0 = load<8 x float> %ptrwi0
%ptrwi1 = getelementptr inbounds %ptrwi0[1], <8 x float>
; ...
; Take into account empty elements
%ptrwi16 = getelementptr inbounds %ptrwi15[10], float

intel / intel-xpu-backend-for-triton

Fine tune sub-group transpose bank conflict prevention for PVC #2797