CUDA runtime performance: parallelism

macpete commented 6 years ago

Hi,

I'm using multiple parallel JIT-compiled Halide pipelines for real-time video processing, using the CUDA back-end. Raw input frames are delivered by frame grabber hardware via DMA into locked PC main memory, which is mapped into the application's address space.

Everything is running fine, functionally. With minor tweaks to the CUDA runtime I could enable the use of individual streams for the video channels (by enabling per-thread default streams and switching to CUDA run-time functions with suffixes _ptds & _ptsz, even the Async versions, where applicable).

However there is virtually no parallelism exploited via these changes, as it is hindered by numerous device allocation and free calls, which cause implicit device-wide synchronization.

This is made worse with each compute_root() I have to use in my schedules. (With the recent merge of the cuda_register_shuffle branch I had to add more compute_root stages to keep JIT compile times acceptable.)

I have already tried to use cudaHostRegister() on my raw input buffers (there is only a fixed number of them, and they are re-used in a round-robin fashion) in the hope of speeding up device uploads. But this hasn't had any effect, as nvvp still shows all host source buffers as type "paged". I even see device_free directly followed by device_alloc of my input buffers (clock: lines were added by me to show that they are indeed back-to-back):

halide_device_free validating input buffer: buffer(1108332249088, 0x7eff95c06108, 0x7eff80c28040, 0, uint8, {-12, 2904, 1}, {-18, 1111, 2904})
CUDA: halide_cuda_device_free (user_context: 0x0, buf: 0x7eff518370d8)
    clock: 4.932616e+03
    cuMemFree 0x1020dc00000
    Time: 1.567780e-01 ms
halide_copy_to_device validating input buffer: buffer(0, 0x0, 0x7eff80c28040, 1, uint8, {-12, 2904, 1}, {-18, 1111, 2904})
halide_device_malloc validating input buffer: buffer(0, 0x0, 0x7eff80c28040, 1, uint8, {-12, 2904, 1}, {-18, 1111, 2904})
halide_device_malloc: target device interface 0x7eff95c06108
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x7eff518370d8)
    allocating buffer(0, 0x0, 0x7eff80c28040, 1, uint8, {-12, 2904, 1}, {-18, 1111, 2904})
    clock: 4.932799e+03
    cuMemAlloc 3226344 -> 0x1020dc00000
    Time: 3.170760e-01 ms
halide_copy_to_device 0x7eff518370d8 host is dirty
c.extent[0] = 2904
c.extent[1] = 1111
CUDA: halide_cuda_buffer_copy (user_context: 0x0, src: 0x7eff518370d8, dst: 0x7eff518370d8)
    from host to device, 0x7eff80c28040 -> 0x1020dc00000, 3226344 bytes
    Time: 5.373290e-01 ms

I keep all my input and output buffers active and re-use them after processing each frame. But this doesn't help with automatic/intermediate buffers used by compute_root.

Is there any way I can make Halide re-use device buffers from previous realizations to get rid of all device allocations during realization? Maybe cache them in the runtime? This would help tremendously to improve parallelism with CUDA.

Thanks, Marc

abadams commented 6 years ago

Yes, by defining your own halide_cuda_malloc and halide_cuda_free that cache allocations (or supply halide with items from your own pool of allocations). Your definitions should clobber Halide's due to weak linkage. Here's a branch where I added a cache in the runtime: https://github.com/halide/Halide/commit/6dd975dd43e4f1fe4961d9c87c3ea1873acb1d0e

I don't recommend using that branch directly - it's a long way behind master - but it illustrates the sort of thing you can do.

macpete commented 6 years ago

Thanks a lot, that was exactly what I was looking for.

Since your commit still applies to current master, I used that as my starting point to build my own Halide distribution. I only had to fix some minor issues (add multi-thread synchronization, fix some use-after-free, alter the free list to represent a queue).

Now, instead of using a constant, I'd like to set the maximum amount of cached device buffers from my application (which uses the JIT compiler), like you did in your branch for the bilateral_grid test (which uses a generator and AOT compilation). But I'm getting an 'undefined reference' linker error for my call to halide_allocation_cache_set_size().

Is this due to the runtime being loaded dynamically (depending on the selected Target) and not really being available until it has been loaded?

I've tried to provide my own versions of some weak runtime symbols to override those from from stock libHalide, but failed to have them used by Halide. Are generators doing extra work to add runtime symbols to the libraries they create?

macpete commented 6 years ago

By copying the static function lookup_runtime_routine() to my JIT-using application I'm now able to find a pointer to halide_allocation_cache_set_size() and call it, if it's available. This way I'm still compatible with stock Halide.

But now I'm dependent on my own builds of Halide. Is there a chance that a device allocation cache API could be added to Halide to aid streaming applications to work better on GPUs? I'd be glad to share my updates to your prototype, if that helps.

MDBrothers commented 6 years ago

I'd love this feature to implemented into the main branch. I never have just a single image to process, nor do I ever have more workers than images to process. The form factor we're forced to use by what the customer finds acceptable is up to a single dedicated (and secured) workstation at a time, with a workload of around 100k 4096 3112 3 * 32 images to process.

halide / Halide

CUDA runtime performance: parallelism #2743