Closed chengjunlu closed 3 months ago
There is no performance improvement by removing the bitcast in the vectorization gather/scatter load/store.
The IGC will shuffle the values to pack it to i32 for memory accessing. It has same effect as we packed it explicitly.
# Use the non-packed type for vectorized store.
store <8 x half> %64, <8 x half> addrspace(1)* %66, align 16, !dbg !339
# The VISA shows the shuffle in packing the value into V0101 and V0099.
.decl V0099 v_type=G type=d num_elts=64 align=hword
.decl V0100 v_type=G type=d num_elts=64 align=hword
.decl V0101 v_type=G type=hf num_elts=128 align=hword alias=<V0099, 0>
.decl V0102 v_type=G type=hf num_elts=128 align=hword alias=<V0100, 0>
...
mov (M1, 16) V0101(0,0)<2> V0093(0,0)<1;1,0> /// $98
mov (M1, 16) V0101(0,1)<2> V0093(1,0)<1;1,0> /// $99
mov (M1, 16) V0101(2,0)<2> V0093(2,0)<1;1,0> /// $100
mov (M1, 16) V0101(2,1)<2> V0093(3,0)<1;1,0> /// $101
mov (M1, 16) V0101(4,0)<2> V0093(4,0)<1;1,0> /// $102
mov (M1, 16) V0101(4,1)<2> V0093(5,0)<1;1,0> /// $103
mov (M1, 16) V0101(6,0)<2> V0093(6,0)<1;1,0> /// $104
mov (M1, 16) V0101(6,1)<2> V0093(7,0)<1;1,0> /// $105
mov (M5, 16) V0102(0,0)<2> V0094(0,0)<1;1,0> /// $106
mov (M5, 16) V0102(0,1)<2> V0094(1,0)<1;1,0> /// $107
mov (M5, 16) V0102(2,0)<2> V0094(2,0)<1;1,0> /// $108
mov (M5, 16) V0102(2,1)<2> V0094(3,0)<1;1,0> /// $109
mov (M5, 16) V0102(4,0)<2> V0094(4,0)<1;1,0> /// $110
mov (M5, 16) V0102(4,1)<2> V0094(5,0)<1;1,0> /// $111
mov (M5, 16) V0102(6,0)<2> V0094(6,0)<1;1,0> /// $112
mov (M5, 16) V0102(6,1)<2> V0094(7,0)<1;1,0> /// $113
lsc_store.ugm.wb.wb (M1, 16) bti(0x2)[V0097]:a32 V0099:d32x4 /// $114
lsc_store.ugm.wb.wb (M5, 16) bti(0x2)[V0098]:a32 V0100:d32x4 /// $115
There is no difference in 01-vector-add with or without explicit bitcast ops.
Close this issue as nothing need to be changed.
The Triton kernel mainly depends on the IGC to help to do the vectorization on the data to convert the SIMT kernel to SIMD kernel which could be executed on Intel GPU. (Compare to control flow vectorization.)
The IGC uses the naive SOA to vectorize the scalar and composed data type. (There would be advantage hybrid SoA and AoS). Demonstrate The rough idea with the C++:
It is naturel to vectorize the general binary operation by simply expand the operation to SIMD instruction.
But it causes extra overhead in some other ops like the bitcast, extract/insert.
Because the layout of the SIMT type
<2xi16>
andi32
are different after the vectorization in SIMD.It requires extra register value shuffle in SIMD while in the SIMT it seems a dummy ops to change the data type.
This is really impact the performance especially for some cases: memory collapsed load/store, DPAS operands compose/decompose.
E.G disassemble for store:
The VISA