Open isuruf opened 1 year ago
For context, in a sumpy P2P kernel I have a temporary of size
local_isrc[5, 45]
which results in 5 memory loads/stores, but it could be split into
local_isrc_s0[2, 45] local_isrc_s1[2, 45] local_isrc_s2[1, 45]
which results in only 3 memory loads/stores.
One way that I can achieve this is to do
lp.split_array_axes(knl, "local_isrc", 0, 2) lp.tag_array_axes(knl, "local_isrc", "C,vec,C")
however this results in 6*45 elements being allocated in shared memory. (Sometimes the compiler optimizes this into 5, 45, sometimes not).
6*45
5, 45
I tried
lp.split_array_axes(knl, "local_isrc", 0, 2) lp.tag_array_axes(knl, "local_isrc", "sep,vec,C")
which does not work.
Sometimes the compiler optimizes this into 5, 45, sometimes not
Turns out, the compiler does optimize it predictably. Was looking at a wrong source code.
For context, in a sumpy P2P kernel I have a temporary of size
which results in 5 memory loads/stores, but it could be split into
which results in only 3 memory loads/stores.
One way that I can achieve this is to do
however this results in
6*45
elements being allocated in shared memory. (Sometimes the compiler optimizes this into5, 45
, sometimes not).I tried
which does not work.