Dynamic GPU memory management

linebender / vello

A GPU compute-centric 2D renderer.

Apache License 2.0

2.35k stars 135 forks source link

As it turns out, yes, I have been thinking about this. I think there are several overlapping issues here:

Estimation of required memory usage for each buffer
Readback from GPU to allocate/schedule future stages
Retry / partial progress on failure

Each buffer is a little different in how these should be handled. I'll start with the ptcl as it has one of the best stories. I think it makes sense for that to have a modest sized allocation (possibly using scene complexity and available GPU memory as inputs to a heuristic), so it's certainly possible the entire ptcl doesn't fit.

The way I'd handle that is to have a scoreboard indicating the progress at each tile - the representation I have in mind is 1 + the pass number at which the tile completed, or 0 if it hasn't. The ptcl writer simply bails if the atomic bump allocation fails, leaving the scoreboard untouched. The fine raster only proceeds if the pass number matches. Therefore, you get the whole surface rendered by repeating the coarse raster (ptcl generation) and fine raster steps, until all tiles complete.

That in turn requires more sophistication in dispatching the compute kernels, as you have to know how many times to iterate. Basically, when dispatching the coarse+fine raster part of the pipeline, it needs to wait on the fence for the cmd buffer to complete, then do a readback of info indicating whether another pass is needed (the easiest way to do this is the value of the bump allocator after coarse raster).

How should it wait? There are two choices, one is to integrate an async runtime, the other is to dedicate a thread to the renderer and do a blocking wait on the fence. The Rust community is increasingly enamored with the former approach, but I think the latter might be better all around. Another minor point while I'm thinking about synchronization is that you can improve CPU/GPU overlap for this by doing two separate cmdbuf submissions, one for the coarse raster and one for the fine, then wait on the fence for the first one before deciding whether to schedule another pass. If it's complete, it can submit an empty cmdlist that signals the semaphore for presentation.

Btw A High-Performance Software Graphics Pipeline Architecture for the GPU has an interesting different approach. The ptcl and fine raster steps are interleaved, and the scheduling is managed by implementing the render pipeline as a "megakernel." They explicitly cite memory usage for the intermediate buffers as a motivation (see Fig 2), but I'm not convinced this is the direction we want to go.

The path coarse raster is not anywhere nearly as nice, because unlike ptcl a partial result is not really usable in future stages. Here I think the stage should just fail, and be retried with a bigger allocation. Also keep in mind that in the current design, the path coarse raster is not binned, so restricting rendering to a subregion doesn't help. That's a decision that could be revisited (it would result in redoing the flattening work for each bin intersecting the path's bbox). In any case, it's likely useful to gather statistics on how many paths failed and how many succeeded; you could make a pretty good estimate of the needed buffer space from that.

The binning stage has the same nature as the path coarse stage, but the memory usage is comparatively very small, so I think it's not much of a problem in practice.

Now, there's a whole nother potential approach to this problem, which is to use the painter's algorithm, retaining the target surface across passes. This approach has some serious appeal, especially because it reduces the need to retain all images and image-like resources such as glyph cache (see #38) before painting. Under certain assumptions, particularly when the scene graph is flat (wrt clipping and blending) and GPU memory is very limited, it might be the best tradeoff.

However, in the presence of blending, the painter's algorithm starts falling apart. In the general case, you have to allocate a temporary image for each level of the clip/blend stack, so the amount of memory needed increases with the depth of the stack. It also requires global memory traffic for the compositing, and precludes (or significantly complicates) many of the optimizations possible when doing blending in the fine raster (see #36). The approach is also generally not efficient when doing antialiased rendering in a mode designed to minimize conflation artifacts (I really should do a writeup of those ideas).

Running shaders on CPU and retaining unchanged scene graph fragments are both interesting issues. For the former, I think one very interesting thing to explore is SwiftShader. I have some thoughts on the latter, but really think we should get to a complete working renderer before going too far down that path. Among other things, the right thing to do really depends on what's changing and what's not, and the complexity (particularly in the API) can go up. I think that needs to be driven by measurements of where the performance pain points are.

I have some more thoughts on this. Tl;dr is that I think I was trying to be too clever with the ptcl scoreboard, and propose a simpler, more uniform approach below.

First, some analysis of the above, as it's not quite right. In the scoreboard approach, I was trying to avoid throwing away the work that was successful. But the fundamental problem is that there's no guarantee a ptcl "tape" that failed to write on one iteration will succeed on the next. If you model it, it depends on the amount of parallelism and the distribution of tape lengths. The worst case is a lot of parallelism and all the tapes the same length. This situation can fail even when there's almost enough memory for all the tapes - you can run out of memory when all the tapes are 90% written, in which case no progress will be made at all.

I also got the coarse path raster part wrong. It is true that a partial result is not useful, but what I missed is that you can reduce memory requirements by restricting the viewport. The size of output buffer has one component that's proportional to the number of paths (which can be estimated accurately before submission, now done with path_count in the encoder), and another that's proportional to the number of tiles in the viewport. A smaller viewport will most definitely reduce the latter. Further, it should be possible to early-out paths whose bbox does not intersect the viewport, before flattening even (an optimization not currently done).

So through that lens, all the allocations have a similar pattern, and I think a common strategy can and should be used for all of them. The pattern is that there is some component which can be estimated simply and accurately during encoding, and another that scales well with viewport size. I am proposing that we keep the principle that all resources (including encoded elements) need to be staged before submission. If the scene has a million elements (as in the paris-30k example), you will need multiple tens of megabytes of buffers, there's no way around it.

The strategy I propose is that all kernels keep using a simple atomic bump allocator, and compare against capacity, taking early-out if it's exceeded. CPU-side you wait on a fence after each kernel, and read back the status of the allocator. If the stage failed because memory was exceeded, you go back and subdivide the viewport, and resubmit from at least the stage that failed, possibly earlier if it's a pain to manage viewports of different sizes).

There are (as always) advantages and disadvantages. Overall I think it's a pretty robust approach, largely because the probability of running out of memory is pretty well correlated. If you fail and subdivide at some stage, the probability that the later stages will succeed with a smaller viewport goes way up. There's also an appealing simplicity, especially in the kernels, as the only thing they need to do specially is stop work when allocation fails. I worry about this because there are potentially a lot of different allocations, and I expect to add more. For example, if we do any kind of stack monad for element processing, for example to accumulate the bboxes for clipping, the monad operator can allocate.

The main disadvantage is that the CPU is waiting on a fence after each kernel, which can reduce utilization. I haven't measured it, though, and I have at least two ideas that could reduce the impact. One is a spin on the ptcl/fine raster idea above. If you have kernels A and B, where B depends on the result of A, then the CPU does a command buffer submission on the dispatch of A, getting a fence, then a command buffer submission on the dispatch of B (without waiting on the fence), then waits on the A fence. The B kernel checks the allocation state of A and does an early-out on failure, so it wastes a minimal amount of work. This pattern can be pipelined as well. I think empirical testing is needed to see how much it helps.

Another approach to the utilization issue is to use multiple queues (the performance benefits of "async compute" generally come from multiple queues). Unfortunately that doesn't help with memory requirements, as you need buffers for all queues that are in flight. Also, I think they're only well supported on discrete graphics, I think Intel doesn't do them (at least pre-Xe), and I haven't looked into mobile.

linebender / vello

Dynamic GPU memory management #40