When --for some reason-- an allocation group for fused storage for multiple Funcs that originally are intended to go in GPUShared gets lifted out of the GPU-block loops, and sits in Heap memory instead, the profiling injection logic assumed that this buffer came from a function with the same name. This buffer was incorrectly determined to be on the stack, as it ignored the custom_new and custom_free attributes of the Allocate node.
Consider this example (also included as a new test):
#include "Halide.h"
using namespace Halide;
int main(int argc, char *argv[]) {
Target t = get_jit_target_from_environment();
if (!t.has_gpu_feature()) {
printf("[SKIP] GPU not enabled\n");
return 0;
}
Var x{"x"}, y{"y"};
Func f1{"f1"}, f2{"f2"};
f1(x, y) = cast<float>(x + y);
f2(x, y) = f1(x, y) * 2;
Func result{"result"};
result(x, y) = f2(x, y);
Var xo{"xo"}, yo{"yo"}, xi{"xi"}, yi{"yi"};
result
.compute_root()
.gpu_tile(x, y, xo, yo, xi, yi, 16, 16)
.reorder(xi, yi, xo, yo)
;
f2.compute_at(result, xo)
.gpu_threads(x, y)
.store_in(MemoryType::Heap)
;
f1.compute_at(result, xo)
.gpu_threads(x, y)
.store_in(MemoryType::Heap)
;
result.print_loop_nest();
t.set_feature(Target::Profile); // Make sure profiling is enabled!
result.compile_jit(t);
printf("Success!\n");
return 0;
}
Produces the following Stmt right before the Profiling pass:
Notice how the allocgroup__f1$0.0__f2$0.1.buffer is outside of the outermost GPU-block loop. When this buffer didn't get lifted out of the kernel, Profiling wasn't an issue, as the profiler doesn't traverse the IR into GPU loops.
When --for some reason-- an allocation group for fused storage for multiple
Func
s that originally are intended to go inGPUShared
gets lifted out of the GPU-block loops, and sits inHeap
memory instead, the profiling injection logic assumed that this buffer came from a function with the same name. This buffer was incorrectly determined to be on the stack, as it ignored thecustom_new
andcustom_free
attributes of theAllocate
node.Consider this example (also included as a new test):
Produces the following Stmt right before the Profiling pass:
Notice how the
allocgroup__f1$0.0__f2$0.1.buffer
is outside of the outermost GPU-block loop. When this buffer didn't get lifted out of the kernel, Profiling wasn't an issue, as the profiler doesn't traverse the IR into GPU loops.The offending line was:
https://github.com/halide/Halide/blob/461c12871f336fe6f57b55d6a297f13ef209161b/src/Profiling.cpp#L274
When instrumenting the allocate node. The node is incorrectly determined to be
on_stack=true
.This PR checks if there is a custom_new and overrides that it is on the stack to false.
@abadams I wonder if we can't simply rely on
Allocate::MemoryType
to determineon_stack
, or is that stillAuto
at that moment?