halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.88k stars 1.07k forks source link

PredicateLoads and ShiftInwards can break sliding window #7819

Open abadams opened 1 year ago

abadams commented 1 year ago

When sliding a producer backwards, values beyond the end of the currently computed block are used. They are valid because they were computed on a previous loop iteration of the consumer.

However, if you split the producer with a PredicateLoad tail strategy, these needed beyond-the-end values can be clobbered with garbage. Here's a failure case:

#include "Halide.h"

using namespace Halide;

int main(int argc, char **argv) {
    Func input, f, g;
    Var x;

    input(x) = x;

    f(x) = input(x);

    const int w = 1024;

    g(x) = f(w - x - 1) + f(w - x) + f(w - x + 1);

    input.compute_root();
    f.store_root().compute_at(g, x);
    f.vectorize(x, 8, TailStrategy::PredicateLoads);

    Buffer<int> buf = g.realize({1024});

    for (int x = 0; x < 1024; x++) {
        int correct = (1024 - x) * 3;
        if (buf(x) != correct) {
            printf("buf(%d) = %d instead of %d\n", x, buf(x), correct);
            return 1;
        }
    }

    return 0;
}
abadams commented 1 year ago

ShiftInwards can also be used to clobber storage with garbage values from before the start of the computed block, by having it load from a producer with smaller compute bounds than allocation bounds, so this fails too:

#include "Halide.h"

using namespace Halide;

int main(int argc, char **argv) {
    Func input("input"), f("f"), g("g");
    Var x("x");

    input(x) = x;

    f(x) = input(x);

    g(x) = f(x - 1) + f(x) + f(x + 1);

    input.compute_root();
    f.in().store_root().compute_at(g, x);
    f.in().vectorize(x, 8, TailStrategy::ShiftInwards);
    f.compute_at(g, x);

    Buffer<int> buf = g.realize({1024});

    for (int x = 0; x < 1024; x++) {
        int correct = x * 3;
        if (buf(x) != correct) {
            printf("buf(%d) = %d instead of %d\n", x, buf(x), correct);
            return 1;
        }
    }

    return 0;
}

It generates the following steady-state IR. Note that a single new value of f is needed per iteration of g.x, so only one is computed. However the wrapper Func is vectorized with ShiftInwards, so it computes 8 values, 7 of them before the start. Because the allocation bounds of f are larger than its compute bounds, these 7 values are garbage. Normally this would be fine and WAI, but g wants to use those values due to sliding.

   for (g.s0.x.$n.rebased, 0, g.extent.0) {
    allocate f[int32 * 8]
    produce f {
     f[7] = input[g.s0.x.$n.rebased + 2]
    }
    produce f_global_wrapper$0 {
     consume f {
      f_global_wrapper$0[ramp(g.s0.x.$n.rebased + 2, 1, 8)] = f[ramp(0, 1, 8)]
     }
    }
    free f
    consume f_global_wrapper$0 {
     g[g.s0.x.$n.rebased] = f_global_wrapper$0[g.s0.x.$n.rebased + 8] + (f_global_wrapper$0[g.s0.x.$n.rebased + 9] + f_global_wrapper$0[g.s0.x.$n.rebased + 7])
    }
   }
abadams commented 1 year ago

We need to either forbid schedules that may store garbage values along dimensions that are slid, or not slide along dimensions that may store garbage values.

abadams commented 1 year ago

and RoundUp could break reverse sliding in the same way that ShiftInwards broke forwards sliding -- by loading garbage values from some earlier Func.