halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.84k stars 1.07k forks source link

Realizing to a crop of a Buffer doesn't work on GPU. #8395

Open mcourteaux opened 3 weeks ago

mcourteaux commented 3 weeks ago

Here is a repro:

#include "Halide.h"
#include "HalideBuffer.h"

using namespace Halide;

int main(int argc, char **argv) {
    Target target = get_jit_target_from_environment();

    bool hexagon_rpc = (target.arch != Target::Hexagon) &&
                       target.has_feature(Target::HVX);

    if (!hexagon_rpc && !target.has_gpu_feature()) {
        printf("[SKIP] No GPU target enabled.\n");
        return 0;
    }

    Param param(Int(32));
    Func f;
    Var x, y;
    f(x, y) = x * 100 + y + param * 10000;
    f.gpu_single_thread();

    const halide_device_interface_t *itf = get_device_interface_for_device_api(
        get_default_device_api_for_target(target), target);
    Runtime::Buffer<int, 2> buf(20, 20);

    // Problem 1: manually triggering the device allocation is required;
    // otherwise the crops we will make will not alias into the same buffer.
    buf.device_malloc(itf);

    param.set(1);
    f.realize(Pipeline::RealizationArg(buf.cropped({{2, 8}, {4, 4}})), target);
    param.set(2);
    f.realize(Pipeline::RealizationArg(buf.cropped({{0, 1}, {2, 8}})), target);

    // Problem 2: the dirty bits of the crop don't propagate back to the parent,
    // so for copy_to_host() to even consider copying, we need to explicitly tell
    // it that the device is dirty, otherwise it no-ops.
    buf.set_device_dirty(true);

    buf.copy_to_host();
    for (int y = 0; y < buf.dim(1).extent(); ++y) {
        for (int x = 0; x < buf.dim(0).extent(); ++x) {
            printf("% 6d", buf(x, y));
        }
        printf("\n");
    }
    printf("\n");

    printf("Success!\n");
    return 0;
}

Comment out either of the two lines below the // Problem comments to see it failing.


Original context for of my use-case. (click to expand)
I'm working on a denoiser and was currently experimenting with denoising in YCoCg colorspace with different filter banks for Y and for Co/Cg. So naturally, I... 1. convert the input to YCoCg 2. make an output buffer that will contain YCoCg (planar layout) 3. make two crops for the output (channel Y and channels Co/Cg), 4. run the denoising pipeline separately to produce both crops. 5. continue working with the original (non-cropped) denoised buffer. ```cpp using U16_Image = Halide::Runtime::Buffer; // (1) Convert to YCoCg U16_Image noisy_YCoCg(noisy.width(), noisy.height(), noisy.channels()); neonraw_convert_RGB16_to_YCoCg16R(noisy, noisy_YCoCg); // (2) Make buffer for the result U16_Image denoised_YCoCg(noisy.width(), noisy.height(), noisy.channels()); if constexpr (true) { // Doens't work on CUDA, but works on CPU. // (3/4a) Realize the pipeline to the Y-crop of the output buffer U16_Image denoised_Y = denoised_YCoCg.cropped(2, 0, 1); neonraw::run_denoiser(fba.filter_banks[mm], noisy_YCoCg, fib, denoised_Y); // (3/4b) Realize the pipeline to the CoCg-crop of the output buffer U16_Image denoised_CoCg = denoised_YCoCg.cropped(2, 1, 2); neonraw::run_denoiser(fba.filter_banks[mm], noisy_YCoCg, fib, denoised_CoCg); } else { // Does work on both CPU and CUDA. neonraw::run_denoiser( fba.filter_banks[mm], noisy_YCoCg, guide_YCoCg, fib, denoised_YCoCg ); } // (5) Work with the denoised_YCoCg buffer... ``` This strategy, as far as I understand should work; and it does work on CPU-compiled pipelines (so everything computed on host, and no device copies). I tried this on CUDA, OpenCL, and Vulkan; and it doesn't work on any of those.
mcourteaux commented 2 weeks ago

I think I narrowed it down to the scenario where the buffer does not have a device allocation, but you realize to a crop. The cropped buffer sees there is no device allocation, and thus allocates, but only allocates the crop instead of the full buffer.

Also, dirty-bits are not updated on the underlying buffer when the cropped/sliced buffer is made dirty. @abadams Can we discuss this on the dev-meeting? It's failing in many subtle ways; so I think some input will be valuable.

mcourteaux commented 2 weeks ago

So, just to be clear: to fix my particular issue I did:

// (2) Make buffer for the result
U16_Image denoised_YCoCg(noisy.width(), noisy.height(), noisy.channels());
denoised_YCoCg.device_malloc(halide_cuda_device_interface());

// ...

// (5) Work with the denoised_YCoCg buffer...
denoised_YCoCg.set_device_dirty(true); // Tell the parent buffer that it actually changed!
mcourteaux commented 2 weeks ago

Summary of current working:

The issue arises because:

So at least, we should take care of keeping track of which other Buffer a Buffer is a crop, always. Also when there is no device-side memory yet.

Additional issue:

mcourteaux commented 2 weeks ago

Conclusion from the dev-meeting:

Either we do:

Either way, we need to figure out why we don't see the error that says that the device is still dirty and the crop goes out-of-scope.

(1) This raises the question: how would you make clear that if you specify you do NOT want device-side aliasing, and yet a device-allocation already exists, the result crop is still going to be aliasing the parent device buffer.

mcourteaux commented 2 weeks ago

@zvookin Managing the dirty-bits is something we haven't discussed yet. I think the starting point would be to modify set_device_dirty():

https://github.com/halide/Halide/blob/b87f2b1a77776367b2f2080e4331aaaba25f0fd7/src/runtime/HalideRuntime.h#L1597-L1603

They would somehow need to propagate this dirty-bit to the parent buffer. However; the link to the parent-buffer --we established-- was going to be through a virtual device-interface. However, this interface has no dirty-bits related functions, so I think we might be stuck again with this approach...

mcourteaux commented 2 weeks ago

zalman_martijn