LOCAL memory access starts at an inefficient point relative to access pattern

dm-maxar commented 1 year ago

Apologies if this isn't a feature that needs to be added.

Suppose one has the following basic loop definition:

knl = lp.make_kernel(
    "{ [n,m]: 0 <= n < N  and 0 <= m < 4 * N}",
    """
    b[n] = sin(a[n])
    c[m] = b[m//4]
    """,
    [lp.ValueArg('N', np.int32),
     lp.GlobalArg('a', np.float32, shape=('N',), is_output=False),
     lp.TemporaryVariable('b', np.float32, shape=lp.auto,
                          address_space=lp.AddressSpace.LOCAL),
     lp.GlobalArg('c', np.float32, shape=lp.auto, is_output=True)], assumptions='1<=N')

I tried to make this very "loopish" in the sense that it doesn't specialize mix in a bunch of assumptions about the final deployed loop solution. Next, suppose we want to run this on OpenCL or any compiler where LOCAL memory is independently allocated for set of threads as in CUDA or OpenCL. Let there be 256 threads in the Workgroup:

knl = lp.split_iname(knl, "n", 256, outer_iname='outer_a', inner_iname='inner_a')
knl = lp.split_iname(knl, "m", 256, outer_iname='outer_b', inner_iname='inner_b')
knl = lp.tag_inames(knl, dict(outer_a="g.0", inner_a="l.0"))
knl = lp.tag_inames(knl, dict(outer_b="g.0", inner_b="l.0"))

If we dump this to code we get

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))

__kernel void __attribute__ ((reqd_work_group_size(256, 1, 1))) loopy_kernel(int const N, __global float const *__restrict__ a, __global float *__restrict__ c)
{
  __local float b[N];

  if (-1 + -256 * gid(0) + N >= 0)
  {
    if (-1 + -256 * gid(0) + -1 * lid(0) + N >= 0)
      b[256 * gid(0) + lid(0)] = sin(a[256 * gid(0) + lid(0)]);
    barrier(CLK_LOCAL_MEM_FENCE) /* for b (insn_0 depends on insn) */;
  }
  if (-1 + -256 * gid(0) + -1 * lid(0) + 4 * N >= 0)
    c[256 * gid(0) + lid(0)] = b[64 * gid(0) + lid(0) / 4];
}

where if b was in global memory, the indexing is fine. However, b is in local memory and this storage and indexing pattern is extremely wasteful. For example, the only values of b[p] that get read are [p]: {64 * gid(0)<=p < 64 * gid(0)+63}. With appropriate offseting of the reads/writes into b, one could have a b.shape=(64,) rather than b.shape=(N,) which is very large for large N.

I think that what needs to happen is there needs to be a transformation that tells loopy "the variable b is now not a regular global scope temporary instruction, it is a workgroup scope temporary variable whose starting point for access should be 0" and then pymbolic can appropriate an offset that should be subtracted from the writes/reads to b so that the shape of b can be no larger than necessary.

Given that this probably can't be done automatically at the moment and that my issue is really a feature enhancement, is there any way for me to override the indexing of b replace b[p] with b[p - 64 * gid(0)] manually?

kaushikcfd commented 1 year ago

Hello dm-maxar, Doing this automatically is out-of-scope for loopy as it is a schedule rewriting system. However, applying the following transformation can reproduce your requirements:

import loopy as lp
import numpy as np
import pyopencl as cl

knl = lp.make_kernel(
    "{ [n,m]: 0 <= n < N  and 0 <= m < 4 * N}",
    """
    <> b[n] = sin(a[n])
    c[m] = b[m//4]
    """,
    [lp.ValueArg("N", np.int32),
     lp.GlobalArg("a", np.float32, shape=("N",), is_output=False),
     lp.GlobalArg("c", np.float32, shape=lp.auto, is_output=True)],
    assumptions="1<=N")

knl = lp.assignment_to_subst(knl, "b")
knl = lp.split_iname(knl, "m", 256, inner_tag="l.0", outer_tag="g.0")
knl = lp.precompute(knl,
                    "b_subst",
                    ("m_inner",),
                    precompute_outer_inames=frozenset({"m_outer"}),
                    temporary_address_space=lp.AddressSpace.LOCAL)
print(lp.generate_code_v2(knl).device_code())

which leads to the following OpenCL code with local memory allocation corresponding to 64 doubles.

```c #define lid(N) ((int) get_local_id(N)) #define gid(N) ((int) get_group_id(N)) __kernel void __attribute__ ((reqd_work_group_size(256, 1, 1))) loopy_kernel(int const N, __global float const *__restrict__ a, __global float *__restrict__ c) { __local float b_subst_0[64]; if (63 + -1 * lid(0) >= 0 && -1 + -64 * gid(0) + -1 * lid(0) + N >= 0) b_subst_0[lid(0)] = sin(a[64 * gid(0) + lid(0)]); barrier(CLK_LOCAL_MEM_FENCE) /* for b_subst_0 (insn_0 depends on b_subst) */; if (-1 + -256 * gid(0) + -1 * lid(0) + 4 * N >= 0) c[256 * gid(0) + lid(0)] = b_subst_0[lid(0) / 4]; } ```

dm-maxar commented 1 year ago

OH hey, that works really well for that case. I have been poking at more complicated variants of that assignment_to_subst -> split_iname -> precompute method and it gets tricky but it worked in that example.

inducer / loopy

LOCAL memory access starts at an inefficient point relative to access pattern #766