GPU / CUDA memory API slow: Bump allocator per pipeline.

mcourteaux commented 2 years ago

TLDR: CUDA memory API is slow, and Halide allocates and frees all intermediate buffers on the fly within the pipeline. Hence, buffer reuse is not in my control. Envisioned elegant solution: have a bump allocator argument for an AOT-compiled pipeline that will be used for all intermediate results. I think this would benefit most targets actually, as bump allocators seem perfect for this purpose of a single pipeline run. This way, I can reuse the bump allocator for sequential pipeline runs, as long as I made sure the capacity is large enough.

What are your thoughts? As this will impact my research directly, I'm considering working on this, but I would like to gather some thoughts first. Some things I am wondering:

Do you see problems for specific targets/architectures?
The bump allocator argument should be optional, for backwards compatibility.
The JIT interface should probably offer this functionality too?
Maybe make it an AOT target feature, as a way of making it optional? Something like Feature::bump_allocator?
I have been looking into the custum device context, custom malloc and free, but I don't think these are intuitive and would require quite a lot of hacks to make it working.

Currently, I think the idea of making a target-feature to generate the pipeline seems the most appealing, overall, as it would just add one argument to the generated function signature, that takes in the bump allocator.

The story of how I got here (feel free to skip):

I'm still optimizing my PhD software in Halide. Currently, a lot of time is wasted on allocations through cuMemAlloc and especially cuMemFree. Ideally, I have a situation where only once in the beginning allocations are done, and once in the end frees are done.

The reason is that my pipeline first selects a number of components it will consider, and then runs all the other pipeline elements using those selected elements. Thus, the number of selected elements is variable from run to run. This causes a lot of buffers of different sizes to be required. So, ideally, I want to have a no allocations, but just use one giant memory region that can fit the worst case scenario worth of buffers.

I investigated in Halide, and there is an option to use halide_reuse_device_allocations(nullptr, true);, which already improved things a little. Next, I did tricks like this: .align_storage(k, 512) to make sure Halide takes multiples of 512, heavily increasing the buffer reuse possibilities and decreasing fragmentation in the Halide-internal device-memory allocator. This reduced memory allocations/frees by a lot (I guess 70%). But still, there are a whole bunch of them left, which do take up time.

For some reason (I'm not familiar with CUDA), the cuMemFree is synchronous at API level, as can be seen in this screenshot from the performance profiler:

In this screenshot, all of the time the GPU is not working, is because it is waiting and synchronizing on memory operations.

Thinking about this, I came to the conclusion that the problem lies in the fact that Halide allocates and frees all intermediate memory buffers in the pipeline itself. Halide does an effort to compute the required buffer sizes in the pipeline, which is nice, as it can use this size to allocate nicely the required amount of memory for each buffer. However, this allocation could instead happen on a bump allocator, which already has a large on-device buffer ready. This will yield instantaneous allocations, instead of waiting on cuMemAlloc and cuMemFree and wasted compute time, like the screenshot about.

In the end, the bump allocator just resets at the end of the pipeline (or alternatively at the beginning?). It then is the programmers responsibility to make sure your bump-allocator capacity is big enough to provide for all the required intermediate buffers the Halide pipeline will need.

abadams commented 2 years ago

A bump allocator is difficult because there might be a parallel CPU loop outside the GPU kernel launch. Something we've talked about for a while is the ability to preallocate all resources at various scopes with an "allocation plan". It's a tricky thing to get right though due to overlapping lifetimes and such. In the short term, do you know why the built-in caching allocator is still allocating? For most of my pipelines it removes allocations entirely in the steady state. Maybe there's a threshold that needs tuning or some knobs to expose.

mcourteaux commented 2 years ago

I see, regarding the CPU-level parallel-for. Kind of a bummer, as that has no real performance improvement I think. GPU doesn't do multiple things in parallel, AFAIK: one kernel at a time, even if you have multiple CUDA contexts. What about a fallback to the regular allocator, when it's sitting in a parallel-for, or something? Because I think 99% of actual use cases won't do this, right?

Regarding what triggers the reallocations, I'm not sure yet. I'll investigate more. :)

mcourteaux commented 2 years ago

Bump allocator can have push and pop functionality for innermost allocations that would happen in a loop.

mcourteaux commented 2 years ago

I think I figured out why there are so many reallocations happening for me:

there are at most 32 unused allocations. Not all my pipelines are optimized to use nice multiples -- of 1024, let's say. So I think they ruin the cache.
I have have a handful of buffers per iteration of my algorithm. I align_storage'd all of them in the hot code. So let's say dimension k is the one I align to 1024. But I have quite a few intermediate buffers. Like k*16*7*7, k*7*7, k*7, k, k*32, k*16, etc... 7 has a special role in my code: I'm working with 7x7 matrices and 7-vectors. Aligning k to 1024 typically has one of 8 sizes: 1024, 2048, ..., 8192. So yeah, all these combinations are more than 32, I assume.
The buffer needs to fit at least 7/8 of it's size for the reuse to happen, which is of course not the case if the buffer changes size from 2048... in the first iteration to 1024... in the second. I was thinking these would get reused, but they don't as the fit is not "tight" enough.

I'll test now with 1024*12 alignment: what I expect to be my worst case...
...one test later... Yes! There are very few allocations and frees now. Still not all are gone, but this is getting in the range of acceptable. Overall performance of application is around 20% faster I think. :smile:

mcourteaux commented 2 years ago

Regarding my first bullet, that not ALL of the pipeliens are optimized to use alignment... I'm wondering if it would make sense to enable and disable the allocation reuse system in Halide only for the pipelines where I want to exploit memory reuse.

halide / Halide

GPU / CUDA memory API slow: Bump allocator per pipeline. #6507