Open abadams opened 1 year ago
ShiftInwards can also be used to clobber storage with garbage values from before the start of the computed block, by having it load from a producer with smaller compute bounds than allocation bounds, so this fails too:
#include "Halide.h"
using namespace Halide;
int main(int argc, char **argv) {
Func input("input"), f("f"), g("g");
Var x("x");
input(x) = x;
f(x) = input(x);
g(x) = f(x - 1) + f(x) + f(x + 1);
input.compute_root();
f.in().store_root().compute_at(g, x);
f.in().vectorize(x, 8, TailStrategy::ShiftInwards);
f.compute_at(g, x);
Buffer<int> buf = g.realize({1024});
for (int x = 0; x < 1024; x++) {
int correct = x * 3;
if (buf(x) != correct) {
printf("buf(%d) = %d instead of %d\n", x, buf(x), correct);
return 1;
}
}
return 0;
}
It generates the following steady-state IR. Note that a single new value of f is needed per iteration of g.x, so only one is computed. However the wrapper Func is vectorized with ShiftInwards, so it computes 8 values, 7 of them before the start. Because the allocation bounds of f are larger than its compute bounds, these 7 values are garbage. Normally this would be fine and WAI, but g wants to use those values due to sliding.
for (g.s0.x.$n.rebased, 0, g.extent.0) {
allocate f[int32 * 8]
produce f {
f[7] = input[g.s0.x.$n.rebased + 2]
}
produce f_global_wrapper$0 {
consume f {
f_global_wrapper$0[ramp(g.s0.x.$n.rebased + 2, 1, 8)] = f[ramp(0, 1, 8)]
}
}
free f
consume f_global_wrapper$0 {
g[g.s0.x.$n.rebased] = f_global_wrapper$0[g.s0.x.$n.rebased + 8] + (f_global_wrapper$0[g.s0.x.$n.rebased + 9] + f_global_wrapper$0[g.s0.x.$n.rebased + 7])
}
}
We need to either forbid schedules that may store garbage values along dimensions that are slid, or not slide along dimensions that may store garbage values.
and RoundUp could break reverse sliding in the same way that ShiftInwards broke forwards sliding -- by loading garbage values from some earlier Func.
When sliding a producer backwards, values beyond the end of the currently computed block are used. They are valid because they were computed on a previous loop iteration of the consumer.
However, if you split the producer with a PredicateLoad tail strategy, these needed beyond-the-end values can be clobbered with garbage. Here's a failure case: