halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.92k stars 1.07k forks source link

Cannot clone_in() two different functions. #6476

Open mcourteaux opened 2 years ago

mcourteaux commented 2 years ago
class GeneratorBug : public Generator<GeneratorBug> {
   public:
    Input<Buffer<float>> input1{"input1", 2};
    Output<Buffer<float>> output1{"output1", 1};
    Output<Buffer<float>> output2{"output2", 1};

    Var k{"kernel"};

    Func intermediate{"intermediate"};

    void generate() {
        intermediate(k) = input1(k, 0) * input1(k, 1);
        output1(k) = intermediate(k) * intermediate(k);
        output2(k) = intermediate(k) + intermediate(k);
    }

    void schedule() {
        // clang-format off
        Func intm_clone_1 = intermediate.clone_in(output1);
        Func intm_clone_2 = intermediate.clone_in(output2); // Error (see below for stack trace)

        intm_clone_1.compute_root();
        intm_clone_2.compute_root();
        // clang-format on
    }
};

HALIDE_REGISTER_GENERATOR(GeneratorBug, generator_bug)

Gives:

Unhandled exception: Internal Error at /home/martijn/w/3rd/Halide/src/Schedule.cpp:348 triggered by user code at : Condition failed: copied_func.defined(): intermediate_clone_in_output1$0

Stack trace, right before assertion:

Thread 1 "generator_em" hit Breakpoint 1, Halide::Internal::FuncSchedule::deep_copy (this=0x5555562f4a50, copied_map=std::map with 2 elements = {...}) at /home/martijn/w/3rd/Halide/src/Schedule.cpp:348
348        internal_assert(copied_func.defined()) << Function(iter.second).name() << "\n";
(gdb) bt
#0  Halide::Internal::FuncSchedule::deep_copy (this=0x5555562f4a50, copied_map=std::map with 2 elements = {...}) at /home/martijn/w/3rd/Halide/src/Schedule.cpp:348
#1  0x00007ffff23bbd34 in Halide::Internal::Function::deep_copy (this=0x7fffffffc0c0, copy=..., copied_map=std::map with 2 elements = {...}) at /home/martijn/w/3rd/Halide/src/Function.cpp:358
#2  0x00007ffff23bc324 in Halide::Internal::Function::deep_copy (this=0x7fffffffc0c0, name="intermediate_clone_in_output2$1", copy=..., copied_map=std::map with 2 elements = {...}) at /home/martijn/w/3rd/Halide/src/Function.cpp:382
#3  0x00007ffff2389451 in Halide::(anonymous namespace)::create_clone_wrapper (wrapped_fn=..., wrapper_name="intermediate_clone_in_output2$1") at /home/martijn/w/3rd/Halide/src/Func.cpp:1954
#4  0x00007ffff23899eb in Halide::(anonymous namespace)::get_wrapper (wrapped_fn=..., wrapper_name="intermediate_clone_in_output2$1", fs=std::vector of length 1, capacity 1 = {...}, clone=true) at /home/martijn/w/3rd/Halide/src/Func.cpp:1977
#5  0x00007ffff238a6e0 in Halide::Func::clone_in (this=0x555555974780, f=...) at /home/martijn/w/3rd/Halide/src/Func.cpp:2023
#6  0x00005555555718f8 in GeneratorBug::schedule (this=0x555555974440) at /usr/include/c++/9/bits/atomic_base.h:318

Analyzing which one of the two clone_in()s it is:

(gdb) f 5
#5  0x00007ffff238a6e0 in Halide::Func::clone_in (this=0x555555974780, f=...) at /home/martijn/w/3rd/Halide/src/Func.cpp:2023
2023       return get_wrapper(func, name() + "_clone_in_" + f.name(), fs, true);
(gdb) p *this
$1 = {func = {contents = {strong = {ptr = 0x555555e38740}, weak = 0x0, idx = 0}}, pipeline_ = {contents = {ptr = 0x0}}}
(gdb) p this->name()
$2 = "intermediate"
(gdb) p f
$3 = (const Halide::Func &) @0x7fffffffc420: {func = {contents = {strong = {ptr = 0x5555560d7600}, weak = 0x0, idx = 0}}, pipeline_ = {contents = {ptr = 0x0}}}
(gdb) p f.name()
$4 = "output2"
(gdb) 

Reveals that it's the second clone_in() call, i.e.: the one on output2, yet the error message is talking about intermediate_clone_in_output1 (notice output1!). So it seems that the first call to clone_in(output1) seems to change intermediate globally, instead of only in the context of output1.

mcourteaux commented 2 years ago

I kinda think that the second clone_in() deep copies the schedule associated with the existing first wrapper along, which is unintended here I believe.

https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Func.cpp#L1954

which goes to this line in Function::deep_copy():

https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Function.cpp#L358

Copying over the wrapper schedule inserted by: https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Function.cpp#L980-L991

Called from: https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Func.cpp#L1977-L1988

mcourteaux commented 2 years ago

I don't understand why a clone_in() should register a wrapper with the original function. I feel like clone_in() should be equivalent to copy-pasting the code for that Func in the generator, and have it be a completely separate Func. No link to the original one. Am I missing something, or is this the actual bug: Func get_wrapper(Function wrapped_fn, string wrapper_name, const vector<Func> &fs, bool clone) always registering a wrapper (both for .clone_in(Func) and for .in(Func))?

(Although, me not understand why, might be due to the fact that I don't know what the purpose of the registering is whatsoever. It just seems odd that a deep copy of a function still has a reference/connection/link to the original function.)

abadams commented 2 years ago

I think it's more that the list of wrappers is also used to store the list of clones. When lowering, consumers should call the appropriate clone instead of the original Func.

If your diagnosis is correct, then using the schedule of the first wrapper definitely seems like the wrong thing.

mcourteaux commented 2 years ago

As extra info: the use case is to preload two buffers (length n and k) into shared memory of GPU that will do computations on the outer product of those two buffers (yielding a function with n*k elements). Computing these n*k elements everywhere they are used in the pipeline is much faster than compute_root() them and have n*k memory accesses for loading them. Using the preload I have n+k global memory reads, and just redo the computation. This outer product is used multiple times, so I want to schedule the preload appropriately for each call site. My use case here is to do the outer_product.clone_in(call_site_1); and the same for call_site_2.


Currently I just duplicated the "outer product" function just 3 times. One instance per call site.