halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.88k stars 1.07k forks source link

Merging Update Definitions in the Schedule? #3943

Closed Mx7f closed 4 years ago

Mx7f commented 5 years ago

Is there any way to merge update definitions without changing the algorithm code (so just using scheduling constructs)? Put another way, given a 1D RDom r, and 2 Funcs f0,f1, can I convert this code:

Halide::Func cost("cost");
cost() = 0.f;
cost() += f0(r.x);
cost() += f1(r.x);

To use a single loop for the updates, like the default schedule for:

Halide::Func cost("cost");
cost() = 0.f;
cost() += f0(r.x) + f1(r.x);
abadams commented 5 years ago

Interesting question. This transformation requires commutativity/associativity of the reduction, and the only transformation we have that exploits those is rfactor, so if such a schedule existed it would have to exploit rfactor somehow. This would be a degenerate rfactor because you don't actually want to factor the reduction....

Ok, so you can use an empty rfactor to lift each reduction out into its own Func. Then you can merge the now independent Funcs' loops with compute_with:

#include "Halide.h"

using namespace Halide;

int main(int argc, char **argv) {
    Func f0, f1, cost;
    Var x;
    f0(x) = x;
    f1(x) = x;

    RDom r(0, 100);
    cost() = 0.f;
    cost() += f0(r.x);
    cost() += f1(r.x);

    f0.compute_root();
    f1.compute_root();

    // Move the reductions into their own Funcs
    Func tmp1 = cost.update(0).rfactor({});
    Func tmp2 = cost.update(1).rfactor({});

    tmp1.compute_root();
    tmp2.compute_root();

    // Now that they're independent funcs, we can fuse the loops using compute_with

    tmp1.update().compute_with(tmp2.update(), r.x);

    cost.realize();

    return 0;
}

This produces the IR:

  allocate cost_intm$1[float32 * 1]
  allocate cost_intm[float32 * 1]
  produce cost_intm {
    produce cost_intm$1 {
      cost_intm$1[0] = 0.000000f
      consume f1$0 {
        consume f0$0 {
          for (cost_intm$1.s1.r$x, 0, 100) {
            cost_intm$1[0] = (cost_intm$1[0] + float32(f1$0[cost_intm$1.s1.r$x]))
            cost_intm[0] = (cost_intm[0] + float32(f0$0[cost_intm$1.s1.r$x]))
          }
        }
      }
      free f0$0
      free f1$0
      cost_intm[0] = 0.000000f
    }
  }
  produce cost {
    cost[0] = 0.000000f
    consume cost_intm {
      cost[0] = (cost[0] + cost_intm[0])
    }
    free cost_intm
    consume cost_intm$1 {
      cost[0] = (cost[0] + cost_intm$1[0])
    }
    free cost_intm$1
  }
abadams commented 5 years ago

waiiit... that IR is buggy! cost_intm is zerod out after its update stage. Adding this works around the bug:

    tmp1.compute_with(tmp2, Var::outermost());

@alexreinking any ideas why the pure stage for cost_intm got stamped down after its update stage? I think you were the last to look at this code.

Mx7f commented 5 years ago

Tried it out on a bit bigger version of the problem (2D, 3 Funcs, computing the Funcs inline); works as expected. Thanks!

Is the auxillary storage (the two cost_intm) and final summation unavoidable? I'm thinking of the storage overhead if instead of reducing into a scalar, we reduced into a vector; and then if we had a gpu-parallelized version, only needing one kernel instead of two.

alexreinking commented 5 years ago

@abadams - not sure! Let me take a look

abadams commented 5 years ago

The auxiliary storage and final summation would be avoidable with store_with, but it's not quite mergeable yet. You could try the branch if you like.

alexreinking commented 5 years ago

Turns out this is a pretty old bug, it never actually worked. Going to see what's needed to fix it.

steven-johnson commented 5 years ago

Did #3947 fix this issue?

alexreinking commented 5 years ago

@steven-johnson -- no, not yet. I'm working on a rewrite of the compute_with feature. It's completely broken.