Open mcourteaux opened 2 years ago
I kinda think that the second clone_in()
deep copies the schedule associated with the existing first wrapper along, which is unintended here I believe.
https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Func.cpp#L1954
which goes to this line in Function::deep_copy()
:
https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Function.cpp#L358
Copying over the wrapper schedule inserted by: https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Function.cpp#L980-L991
Called from: https://github.com/halide/Halide/blob/fb305fd73a2727fdf3682bade6a0c75ed1785524/src/Func.cpp#L1977-L1988
I don't understand why a clone_in()
should register a wrapper with the original function. I feel like clone_in()
should be equivalent to copy-pasting the code for that Func in the generator, and have it be a completely separate Func. No link to the original one. Am I missing something, or is this the actual bug: Func get_wrapper(Function wrapped_fn, string wrapper_name, const vector<Func> &fs, bool clone)
always registering a wrapper (both for .clone_in(Func)
and for .in(Func)
)?
(Although, me not understand why, might be due to the fact that I don't know what the purpose of the registering is whatsoever. It just seems odd that a deep copy of a function still has a reference/connection/link to the original function.)
I think it's more that the list of wrappers is also used to store the list of clones. When lowering, consumers should call the appropriate clone instead of the original Func.
If your diagnosis is correct, then using the schedule of the first wrapper definitely seems like the wrong thing.
As extra info: the use case is to preload two buffers (length n and k) into shared memory of GPU that will do computations on the outer product of those two buffers (yielding a function with n*k
elements). Computing these n*k
elements everywhere they are used in the pipeline is much faster than compute_root()
them and have n*k
memory accesses for loading them. Using the preload I have n+k
global memory reads, and just redo the computation. This outer product is used multiple times, so I want to schedule the preload appropriately for each call site. My use case here is to do the outer_product.clone_in(call_site_1)
; and the same for call_site_2
.
Currently I just duplicated the "outer product" function just 3 times. One instance per call site.
Gives:
Stack trace, right before assertion:
Analyzing which one of the two
clone_in()
s it is:Reveals that it's the second
clone_in()
call, i.e.: the one onoutput2
, yet the error message is talking aboutintermediate_clone_in_output1
(noticeoutput1
!). So it seems that the first call toclone_in(output1)
seems to changeintermediate
globally, instead of only in the context ofoutput1
.