Closed Mx7f closed 4 years ago
Interesting question. This transformation requires commutativity/associativity of the reduction, and the only transformation we have that exploits those is rfactor, so if such a schedule existed it would have to exploit rfactor somehow. This would be a degenerate rfactor because you don't actually want to factor the reduction....
Ok, so you can use an empty rfactor to lift each reduction out into its own Func. Then you can merge the now independent Funcs' loops with compute_with:
#include "Halide.h"
using namespace Halide;
int main(int argc, char **argv) {
Func f0, f1, cost;
Var x;
f0(x) = x;
f1(x) = x;
RDom r(0, 100);
cost() = 0.f;
cost() += f0(r.x);
cost() += f1(r.x);
f0.compute_root();
f1.compute_root();
// Move the reductions into their own Funcs
Func tmp1 = cost.update(0).rfactor({});
Func tmp2 = cost.update(1).rfactor({});
tmp1.compute_root();
tmp2.compute_root();
// Now that they're independent funcs, we can fuse the loops using compute_with
tmp1.update().compute_with(tmp2.update(), r.x);
cost.realize();
return 0;
}
This produces the IR:
allocate cost_intm$1[float32 * 1]
allocate cost_intm[float32 * 1]
produce cost_intm {
produce cost_intm$1 {
cost_intm$1[0] = 0.000000f
consume f1$0 {
consume f0$0 {
for (cost_intm$1.s1.r$x, 0, 100) {
cost_intm$1[0] = (cost_intm$1[0] + float32(f1$0[cost_intm$1.s1.r$x]))
cost_intm[0] = (cost_intm[0] + float32(f0$0[cost_intm$1.s1.r$x]))
}
}
}
free f0$0
free f1$0
cost_intm[0] = 0.000000f
}
}
produce cost {
cost[0] = 0.000000f
consume cost_intm {
cost[0] = (cost[0] + cost_intm[0])
}
free cost_intm
consume cost_intm$1 {
cost[0] = (cost[0] + cost_intm$1[0])
}
free cost_intm$1
}
waiiit... that IR is buggy! cost_intm is zerod out after its update stage. Adding this works around the bug:
tmp1.compute_with(tmp2, Var::outermost());
@alexreinking any ideas why the pure stage for cost_intm got stamped down after its update stage? I think you were the last to look at this code.
Tried it out on a bit bigger version of the problem (2D, 3 Func
s, computing the Func
s inline); works as expected. Thanks!
Is the auxillary storage (the two cost_intm) and final summation unavoidable? I'm thinking of the storage overhead if instead of reducing into a scalar, we reduced into a vector; and then if we had a gpu-parallelized version, only needing one kernel instead of two.
@abadams - not sure! Let me take a look
The auxiliary storage and final summation would be avoidable with store_with, but it's not quite mergeable yet. You could try the branch if you like.
Turns out this is a pretty old bug, it never actually worked. Going to see what's needed to fix it.
Did #3947 fix this issue?
@steven-johnson -- no, not yet. I'm working on a rewrite of the compute_with
feature. It's completely broken.
Is there any way to merge update definitions without changing the algorithm code (so just using scheduling constructs)? Put another way, given a 1D
RDom
r, and 2Func
s f0,f1, can I convert this code:To use a single loop for the updates, like the default schedule for: