Closed dm-maxar closed 1 year ago
Hello dm-maxar, Doing this automatically is out-of-scope for loopy as it is a schedule rewriting system. However, applying the following transformation can reproduce your requirements:
import loopy as lp
import numpy as np
import pyopencl as cl
knl = lp.make_kernel(
"{ [n,m]: 0 <= n < N and 0 <= m < 4 * N}",
"""
<> b[n] = sin(a[n])
c[m] = b[m//4]
""",
[lp.ValueArg("N", np.int32),
lp.GlobalArg("a", np.float32, shape=("N",), is_output=False),
lp.GlobalArg("c", np.float32, shape=lp.auto, is_output=True)],
assumptions="1<=N")
knl = lp.assignment_to_subst(knl, "b")
knl = lp.split_iname(knl, "m", 256, inner_tag="l.0", outer_tag="g.0")
knl = lp.precompute(knl,
"b_subst",
("m_inner",),
precompute_outer_inames=frozenset({"m_outer"}),
temporary_address_space=lp.AddressSpace.LOCAL)
print(lp.generate_code_v2(knl).device_code())
which leads to the following OpenCL code with local memory allocation corresponding to 64 doubles.
OH hey, that works really well for that case. I have been poking at more complicated variants of that assignment_to_subst -> split_iname -> precompute method and it gets tricky but it worked in that example.
Apologies if this isn't a feature that needs to be added.
Suppose one has the following basic loop definition:
I tried to make this very "loopish" in the sense that it doesn't specialize mix in a bunch of assumptions about the final deployed loop solution. Next, suppose we want to run this on OpenCL or any compiler where LOCAL memory is independently allocated for set of threads as in CUDA or OpenCL. Let there be 256 threads in the Workgroup:
If we dump this to code we get
where if
b
was in global memory, the indexing is fine. However,b
is in local memory and this storage and indexing pattern is extremely wasteful. For example, the only values ofb[p]
that get read are[p]: {64 * gid(0)<=p < 64 * gid(0)+63}
. With appropriate offseting of the reads/writes intob
, one could have ab.shape=(64,)
rather thanb.shape=(N,)
which is very large for largeN
.I think that what needs to happen is there needs to be a transformation that tells loopy "the variable b is now not a regular global scope temporary instruction, it is a workgroup scope temporary variable whose starting point for access should be 0" and then pymbolic can appropriate an offset that should be subtracted from the writes/reads to
b
so that the shape of b can be no larger than necessary.Given that this probably can't be done automatically at the moment and that my issue is really a feature enhancement, is there any way for me to override the indexing of
b
replaceb[p]
withb[p - 64 * gid(0)]
manually?