inducer / loopy

A code generator for array-based code on CPUs and GPUs
http://mathema.tician.de/software/loopy
MIT License
588 stars 73 forks source link

Different domains for statements #500

Closed thisiscam closed 3 years ago

thisiscam commented 3 years ago

I'm wondering if it is possible to specify a list of statements and associate each statement with its own domain in loopy? I looked into the documentation, and wasn't sure if this is possible.

If I understand correctly, I might need to take the product of the domains of all the statements, so that loopy will do a projection for each statement? If that's the case, it seems fine but a hassle for my use case.

kaushikcfd commented 3 years ago

Hi @thisiscam, Yep, it's possible to do that. In loopy, a kernel is list of statements+list of domains. Each domain is described via an ISL basic set. For each statement it is known which "inames"[^1] the statement is a part of, and the final domain for the generated code of a statement is obtained by intersecting the projection of each domain onto the inames of a statement[^2].

Below is an example kernel that has 2 non-interacting domains and 2 statements, with one domain corresponding to each statement:

knl = lp.make_kernel(
    ["{[i0, i1]: 0<=i0, i1<10}",
     "{[i2, i3]: 0<=i2, i3<10}"],
    """
    <> tmp[i0, i1] = 2*i0 + i1
    out[i2, i3] = 2*tmp[i2, i3]
    """)

print(lp.generate_code_v2(knl).device_code())

generates the code:

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ out)
{
  int tmp[10 * 10];

  for (int i0 = 0; i0 <= 9; ++i0)
    for (int i1 = 0; i1 <= 9; ++i1)
      tmp[10 * i0 + i1] = 2 * i0 + i1;
  for (int i2 = 0; i2 <= 9; ++i2)
    for (int i3 = 0; i3 <= 9; ++i3)
      out[10 * i2 + i3] = 2 * tmp[10 * i2 + i3];
}

[^1]: loopy's term for loop induction variables [^2]: We maintain some internal variables so that the relevant domain is computed in O(N), N being the number of inames surrounding a statement.

thisiscam commented 3 years ago

Thanks for the quick response!

It wasn't clear to me that the list of domains and list of statements have an one-to-one correspondence. For example, the following code taken from the tutorial has one domain but two statements -- which I don't understand how the mapping is specified/assumed? I looked at your explanation again and I understand this now. It seems like I need to use unique inames for each statement, so that they map to different domains?

knl = lp.make_kernel(
    "{ [i,j,ii,jj]: 0<=i,j,ii,jj<n }",
    """
    out[j,i] = a[i,j] {id=transpose}
    out[ii,jj] = 2*out[ii,jj]  {dep=transpose}
    """,
    [lp.GlobalArg("out", shape=lp.auto, is_input=False), ...])
knl = lp.prioritize_loops(knl, "i,j,ii,jj")

Further, is it possible to specify dependencies via ISL Maps, instead of the {dep=...} syntax?

kaushikcfd commented 3 years ago

It seems like I need to use unique inames for each statement, so that they map to different domains?

The inames need not be unique i.e. multiple statements can be within a single iname. For example here is a kernel that has multiple statements in the same iname, and the generated code will put the statements sharing inames into the same loop nest.

Further, is it possible to specify dependencies via ISL Maps, instead of the {dep=...} syntax?

Not yet. But https://github.com/inducer/loopy/pull/169 is WIP.

thisiscam commented 3 years ago

The inames need not be unique i.e. multiple statements can be within a single iname.

Sorry I wasn't clear. I meant that if I already have a list of statement-domain pairs, in order to map it to loopy representation, I would need to have unique inames for each statement-domain pair, so that loopy will do the correct one-to-one projection?

But #169 is WIP.

Thanks! Any idea on a timeline for this?

I should mention that my current goal is to find a code generator that operates similar to ISL's scheduling and code generation, while frees me from manually translating ISL ASTs into real CPU/GPU code.

kaushikcfd commented 3 years ago

I would need to have unique inames for each statement-domain pair, so that loopy will do the correct one-to-one projection?

Yep, that sounds correct. In loopy each iname gets mapped to a C-styled for-loop or a hardware loop, but a restriction to keep in mind is that each loop can be entered exactly once, if not possible due to the user provided statement DAG then a scheduling error is raised.

Thanks! Any idea on a timeline for this?

@inducer might have a better idea.

thisiscam commented 3 years ago

but a restriction to keep in mind is that each loop can be entered exactly once, if not possible due to the user provided statement DAG then a scheduling error is raised.

This seems like a critical restriction. Can you give an example of a statement DAG? Thanks!

kaushikcfd commented 3 years ago

This kernel for example:

knl = lp.make_kernel(
    "{[i0, i1]: 0<=i0, i1<10}",
    """
    a[i0] = 1  {id=write_a}
    b[i1] = 2  {id=write_b, dep=write_a}
    c[i0] = 3  {id=write_c, dep=write_b}
    """)

print(lp.generate_code_v2(knl).device_code())

fails with the error message:

ERROR: Sorry--loopy did not find a schedule for your kernel

as unless the i0 is entered twice (once for writing "a" and once for writing "c") it won't be possible to have a schedulable loopy kernel.

A common trick in this case is to fix this by the transformation; knl = lp.duplcate_inames(knl, 'i0', within='id:write_c'), after which we generate the following OpenCL code;

__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1))) loopy_kernel(__global int *__restrict__ a, __global int *__restrict__ b, __global int *__restrict__ c)
{
  for (int i0 = 0; i0 <= 9; ++i0)
    a[i0] = 1;
  for (int i1 = 0; i1 <= 9; ++i1)
    b[i1] = 2;
  for (int i0_0 = 0; i0_0 <= 9; ++i0_0)
    c[i0_0] = 3;
}
thisiscam commented 3 years ago

I see.

Does loopy always schedule to multiple loops when the inames are different? For example, can loopy do fusion for your aforementioned example (without changing to the same inames):

knl = lp.make_kernel(
    ["{[i0, i1]: 0<=i0, i1<10}",
     "{[i2, i3]: 0<=i2, i3<10}"],
    """
    <> tmp[i0, i1] = 2*i0 + i1
    out[i2, i3] = 2*tmp[i2, i3]
    """)

print(lp.generate_code_v2(knl).device_code())

?

A higher level question, I think, is what's the diff between loopy's scheduling and ISL's scheduling? My understanding is that the two approach work differently: in ISL, I specify statements and their domain, and ISL will help me rearrange them into nested loops (with some automatic fusion etc.) In loopy, the loop nests are contolled by inames

kaushikcfd commented 3 years ago

No loopy doesn't attempt fusion automatically because profitability of loop fusion is highly domain specific and hence must be user-guided. We have plans to implement loop fusion heuristics available in the literature in https://github.com/inducer/loopy/pull/493. (should be available in main in ~1 week)

thisiscam commented 3 years ago

@kaushikcfd Many thanks. That helps a lot!