inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
580 stars 70 forks source link

Allow vec tagging of odd sizes for local temporaries #779

Open isuruf opened 1 year ago

isuruf commented 1 year ago

For context, in a sumpy P2P kernel I have a temporary of size

local_isrc[5, 45]

which results in 5 memory loads/stores, but it could be split into

local_isrc_s0[2, 45]
local_isrc_s1[2, 45]
local_isrc_s2[1, 45]

which results in only 3 memory loads/stores.

One way that I can achieve this is to do

lp.split_array_axes(knl, "local_isrc", 0, 2)
lp.tag_array_axes(knl, "local_isrc", "C,vec,C")

however this results in 6*45 elements being allocated in shared memory. (Sometimes the compiler optimizes this into 5, 45, sometimes not).

I tried

lp.split_array_axes(knl, "local_isrc", 0, 2)
lp.tag_array_axes(knl, "local_isrc", "sep,vec,C")

which does not work.

isuruf commented 1 year ago

Sometimes the compiler optimizes this into 5, 45, sometimes not

Turns out, the compiler does optimize it predictably. Was looking at a wrong source code.