This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:
Store (64 banks * 8 B/bank / X B/element) elements
Leave (1 bank * 8 B/bank / X B/element) empty spots
Store (64 banks * 8 B/bank / X B/element) elements
...
Assuming fp32 elements:
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
Again, for fp32, in terms of code:
; Store untransposed
call spir_funccc void @intel_sub_group_block_write8(ptr(3) %ptr0, <8 x float> %data)
%ptr1 = getelementptr inbounds %ptr0[130], float
; ...
; Load transposed
%vec0 = load<8 x float> %ptrwi0
%ptrwi1 = getelementptr inbounds %ptrwi0[1], <8 x float>
; ...
; Take into account empty elements
%ptrwi16 = getelementptr inbounds %ptrwi15[10], float
As of now, sub-group transpose bank conflict prevention leaves a single item every 17 items ((sub-group size = 16) + 1) to avoid bank conflicts:
This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:
Assuming
fp32
elements:Again, for
fp32
, in terms of code: