Allow storage outside parallel loop with allow_race_conditions

ongjunjie commented 5 years ago

Use case for this is when running on GPU, some intermediate stages need to store in global memory as available shared memory is too little. Today, this behaviour can be only be achieved by scheduling these stages to run in its own kernel rather than fusing with its consumer.

Toy example:

#include <Halide.h>

int main() {
  Halide::Func f, g;
  Halide::Var x;

  f(x) = x;
  g(x) = f(x) + 1;

  g.parallel(x);
  f.store_root().allow_race_conditions().compute_at(g, x);
  g.compile_jit();

  return 0;
}

This gives the error

Error: Func "f0" is stored outside the parallel loop over f1.v0 but computed within it. This is a potential race condition.

abadams commented 5 years ago

There are a few options for supporting this:

Allow it, but only in cases where any race conditions are benign (storing the same value to the same site). This would probably be implemented by just disallowing update definitions in cases where we can't prove no cross-talk.
Allow it, but interpret it as a separate memory region per parallel task, so that there's no possibility of cross-talk. This is a bit weird because really you have a large array of allocations, instead of a single allocation. The number of allocations is going to be the trip count of the outer parallel loop, which is inefficient, because really you'd like it to be the max number of parallel tasks, but then you'd have to manage it with an allocator...
Solve the use case more specifically by supporting compute_at(g, gpu_blocks).store_in(MemoryType::Global) or similar, then implement it however we like. The first thing to try would be just calling malloc inside the generated kernel.
Detect cases where we're going to allocate too much shared memory automatically and lower those to calling malloc inside the kernel. We could also make a pool of global memory and manage it with our own allocator, if malloc proves to be weirdly slow.

Thoughts?

ongjunjie commented 5 years ago

First option sounds like what I proposed. I would just add that it may not be Halide's responsibility to prove no cross talk. The user can specify this with allow_race_condition(). If there is indeed a race condition, then scheduling the producer function within the parallel loop of the consumer would be just plain wrong and the user should be faulted for that.
Agreed, second option is weird.
Third option, there's still a need to ensure no data race for correctness. Either by Halide trying harder or with user specifying. Once that's in, it's basically the first option with nicer syntax.
Last option would be the dream solution for users like myself. I wouldn't need to worry about what GPU hardware the code is running on. I can't imagine how it can be implemented in a sane way though. To support AOT, you will need 2^n kernels for n producer functions in case one of them goes into global memory.

halide / Halide

Allow storage outside parallel loop with allow_race_conditions #3742