Open mcourteaux opened 3 months ago
I think I narrowed it down to the scenario where the buffer does not have a device allocation, but you realize to a crop. The cropped buffer sees there is no device allocation, and thus allocates, but only allocates the crop instead of the full buffer.
Also, dirty-bits are not updated on the underlying buffer when the cropped/sliced buffer is made dirty. @abadams Can we discuss this on the dev-meeting? It's failing in many subtle ways; so I think some input will be valuable.
So, just to be clear: to fix my particular issue I did:
// (2) Make buffer for the result
U16_Image denoised_YCoCg(noisy.width(), noisy.height(), noisy.channels());
denoised_YCoCg.device_malloc(halide_cuda_device_interface());
// ...
// (5) Work with the denoised_YCoCg buffer...
denoised_YCoCg.set_device_dirty(true); // Tell the parent buffer that it actually changed!
Summary of current working:
The issue arises because:
crop_from()
function is not called, which would have kept track of the original buffer. As a result the new cropped Buffer object doesn't know from which other Buffer it is a crop.So at least, we should take care of keeping track of which other Buffer a Buffer is a crop, always. Also when there is no device-side memory yet.
Additional issue:
device_malloc()
function doesn't take into account that a halide_buffer_t
might be a crop/slice.halide_buffer_t
currently has no fields pointing to any crop-related information. Yet Pipelines calling out to device_malloc()
act on a halide_buffer_t
, instead of a Halide::Runtime::Buffer<>
, so there is right now no way to even know that it's a crop from within device_malloc()
.Conclusion from the dev-meeting:
Either we do:
halide_device_interface_t
that will serve as a proxy interface that takes care of actually allocating the parent buffer. Such halide_device_interface_t
can be provided by the C++ wrapper Halide::Runtime::Buffer
on-demand. Afterwards, cropping and slicing Halide::Runtime::Buffer
s should have the ability to let us explicitly specify that the resulting cropped buffer should have this custom virtual device interface (and thus correctly behave as a device-side crop), or it doesn't (and therefore behave as just a temporary device-side allocation of a subregion with the intent of copying it back to the host once it's done). I think a good term for this feature would be Device-side Aliasing (footnote 1). This will somehow also require reworking where the dirty bits live.Either way, we need to figure out why we don't see the error that says that the device is still dirty and the crop goes out-of-scope.
(1) This raises the question: how would you make clear that if you specify you do NOT want device-side aliasing, and yet a device-allocation already exists, the result crop is still going to be aliasing the parent device buffer.
@zvookin Managing the dirty-bits is something we haven't discussed yet. I think the starting point would be to modify set_device_dirty()
:
They would somehow need to propagate this dirty-bit to the parent buffer. However; the link to the parent-buffer --we established-- was going to be through a virtual device-interface. However, this interface has no dirty-bits related functions, so I think we might be stuck again with this approach...
Here is a repro:
Comment out either of the two lines below the
// Problem
comments to see it failing.Original context for of my use-case. (click to expand)
I'm working on a denoiser and was currently experimenting with denoising in YCoCg colorspace with different filter banks for Y and for Co/Cg. So naturally, I... 1. convert the input to YCoCg 2. make an output buffer that will contain YCoCg (planar layout) 3. make two crops for the output (channel Y and channels Co/Cg), 4. run the denoising pipeline separately to produce both crops. 5. continue working with the original (non-cropped) denoised buffer. ```cpp using U16_Image = Halide::Runtime::Buffer