Open raphlinus opened 1 year ago
We've done considerable thinking on this issue, and here's where we are now.
For production usage, async is just too big a burden on the client, and sync blocking on readback would be native-only and impact responsiveness (including accessibility, as the UI runloop would not be able to respond to accessibility events while blocking on the GPU). Thus, all production uses will use a "fire and forget" modality. In most cases, that's option 7 above.
It's worth expanding on that. We already have a blit in place because wgpu doesn't implement binding the render target (swapchain) as storage. We can just skip writing to the storage texture on failure, letting the blit read the previous contents. This adds little overhead or added implementation work to the current state. If at some point (as I believe is required by the WebGPU spec) we can do more juggling of textures shared between the fine shader and the swapchain, as suggested by 7. The details will probably be tricky.
This also interacts favorably with incremental present (damage regions). If we have the blit in place, then the rendering pipeline can render just the part of the scene that has been changed. Neither WebGPU nor wgpu support plumbing of incremental present to the compositor, nor does Metal, and it may be partially implemented on Windows, as DXGI supports Present1 but the newer presentation manager's Present method seems to lack the capability, and in my testing, the Vulkan incremental present extension was not plumbed through to the compositor, while D3D supports it through DXGI.
On failure, we have non-async readback, a "get last success/failure" that is intended to be called at the beginning of the next paint cycle. If this indicates failure, then the client can reallocate buffers before issuing another render.
Depending on the application, option 2 may be desirable. In encoding, we can add optional resource usage tracking. For vector paths, that's basically a 3-tuple: number of path segments, perimeter, bounding box area. RAM usage scales by transform: (1, x, x^2) respectively, where x is the linear scale factor, so "append with transform" would apply that scaling. For retained layers, the cost may not be prohibitive, the big hit is in the dynamic case.
Having resource tracking would enable a few other possibilities. One is that in RAM-constrained environments instead of allocating buffers big enough to hold the entire scene, we could spatially subdivide, running multiple passes. The resource tracking tuples may also be useful to detect heavily zoomed out cases (#305) and select alternate strategies. Most crudely, it could render at higher resolution and then downsample later.
There's one more thing to add. While async/readback is unsuitable for production, there is considerable value in testing to run GPU stages and then read back the output. For one, we'd like to be able to run arbitrary mixes of CPU and GPU stages, as opposed to the current limitation where we start on CPU and then cut over to GPU. And for two, we'd like to be able to test individual stages by having validation of the buffer readback. In some cases, that can be done by running both CPU and GPU stage, and comparing the output, though nondeterminism may require other approaches. For these cases, we plan on having an async mode for Recording
playback for testing only, not exposed to production clients.
This is a difficult tradeoff space, and one that we'll likely iterate on. I may schedule some research into parallel computers with more agile execution models than least common denominator portable GPUs, but that is a bit unsatisfying in terms of shipping for existing users. I believe what's captured here is a good set of tradeoffs for now.
By the way:
However, that requires the ability to launch work dependent on computation done in a shader. [...]
Isn't that just dispatch_indirect()
?
Thanks for the interest.
Unfortunately that's not the case. We need to launch a set of tasks which use the same buffers. That is, the first step outputs part of the answer into a limited size buffer, then the next step uses that buffer, then we launch the remainder of the first step again. dispatch_indirect
doesn't let us do that.
One option could be to launch the tasks say 5 times, where if there are no remaining items, the tasks are indirectly dispatched to a size of zero (or 1 if the API requires that). But that is messy, and the following dispatches have the potential to be quite slow, even for doing no work, because they make the GPU context switch.
We're going to have a meeting on this issue tomorrow (UK and US time). Details, including a calendar link) can be found at https://xi.zulipchat.com/#narrow/stream/197075-gpu/topic/Meeting.20on.20Robust.20Dynamic.20Memory
We had this meeting (notes), and doing some experiments leading to a few conclusions.
First, option 2 might not be as expensive as we previously thought. The version we have committed seemed to make encoding take twice as long in most cases. That's still quite costly,
The other observations about option 7 has some more profound implications. In particular, the way that pipelining interacts with it has some poor consequences. Consider the worst-case scenario for this option, where we were rendering a simple scene, and a much more complex scene[^1], which is already slow to render, is entered:
During processing for frame 495, we receive the input which loads paris-30k. Frame 496's CPU side work therefore takes slightly longer, as it is computing based on the larger scene. This launches its work on the GPU, with buffers tuned to the small scene. This will fail, which we want it to, and we will get the data read back from this frame once it does.
However, frame 493's GPU work finished quickly, (as it was rendering the simple scene), meaning that swapchain buffer 47 was available.
This means that it is free to perform its CPU side work. When the CPU side work, the bump buffer from frame 496 hasn't yet been received, but that's expected. We don't want to block on that, because that means we definitely cannot saturate the GPU in cases where we're not being throttled by vsync.
However, because of this, we have no indication that frame 496 is going to/has already failed. And so, the GPU side work of frame 497 is launched, with the same buffer sizes.
This work is however entirely redundant, as it will fail due to buffer size issues.
(The same thing would happen for frame 498, but the driver prevents this by blocking in VkQueuePresentKHR - waitForever
)
However, the GPU work for frame 498, which is our model would know that frame 496 failed, cannot be scheduled onto the GPU until the GPU work for frame 497 (which doesn't result in a new image) completes
Without blocking on the previous frame completing, there's no feasible way to prevent this case. This means that the pure form of option 7 adds much more latency than previously thought.
[^1]: Startup might be even worse, but this analysis doesn't cover this case.
I opened #541 as a parallel discussion on option 2.
Just dropping this here, as it could be a possible answer to the allocation worries: https://github.com/pcwalton/offset-allocator
It's a pcwalton's rust port of Sebastian Aaltonen's C++ offset allocator (https://github.com/sebbbi/OffsetAllocator), "It's a fast, simple, hard real time allocator..." It's being considered as one potential solution bevy's allocation hurdles that they're facing on their graphics front as well.
Just dropping this here, as it could be a possible answer to the allocation worries: https://github.com/pcwalton/offset-allocator
It's a pcwalton's rust port of Sebastian Aaltonen's C++ offset allocator (https://github.com/sebbbi/OffsetAllocator), "It's a fast, simple, hard real time allocator..." It's being considered as one potential solution bevy's allocation hurdles that they're facing on their graphics front as well.
Thank you for the pointer. It doesn't address this issue, which is about responding to allocation failures for GPU-driven bump-allocated data structures in which the precise memory requirement is unknown ahead of GPU command submission.
That said offset allocator is interesting in general. FWIW, the wgpu_engine.rs code already employs a similar strategy for its resource pool (based on size classes).
One of the stickier points is how to handle robust dynamic memory. The fundamental problem is that the pipeline creates intermediate data structures (grids of tiles containing coarse winding numbers and path segments, per-tile command lists) whose size depends dynamically on the scene being rendered. For example, just changing a transform can significantly affect the number of tiles covered by a path, and thus the size of these data structures.
The standard GPU compute shader execution model cannot express what we need. At the time a command buffer is submitted, all buffers have pre-determined size. There is no way to dynamically allocate memory based on computation done inside a compute shader (note: CUDA doesn't have this limitation, shaders can simply call malloc or invoke C++
new
).Another potential way to address the fundamental problem is to divide work into chunks so that intermediate results fit in fixed size buffers. This would be especially appealing in resource constrained environments where calling malloc may not be guaranteed to succeed, or may cause undesirable resource contention. However, that requires the ability to launch work dependent on computation done in a shader. Again, CUDA can do this (for example, with device graph launch) but it is not a common capability of compute shaders, much less WebGPU.
The previous incarnation, piet-gpu, had a solution (see #175), but with drawbacks. Basically, rendering a scene required a fence back to the GPU to read a buffer with a success/failure indication, with reallocation and retry on failure. However, this requires the ability to do blocking readback (which is missing in WebGPU), and also blocks the calling thread (usually the UI runloop) until the result of the computation is available (which is the reason why it's missing in WebGPU).
There's no good solution to this problem, only a set of tradeoffs. We need to decide what to implement. The best choice will depend on the details of what's being rendered and which application it's integrated with. In no particular order, here are some of the choices:
When the maximum complexity of the scenes being rendered is known in advance, then the buffer sizes can simply be determined in advance. On failure, the scene would fail to render. This may well be the best choice for games and applications in which the UI is not rendering user-provided content. It allows the entire rendering pipeline to be launched as "fire and forget" with no blocking.
We could do analysis CPU-side to determine the memory usage, before launching the render. This is simple and poses no integration challenges, but such analysis is slow. In fact, it's probably comparable to running the compute pipeline on CPU and just using the GPU for fine rasterization, a modality we're considering as a compatibility fallback. It may be a viable choice when the scene complexity is low.
We can implement blocking in a similar fashion as piet-gpu (this is closest to the current direction of the code). That would be native-only, so would require another approach for Web deployment. It also potentially creates integration issues, as calls to the Vello renderer would have to support blocking, and also somebody has to pump the wgpu process (https://github.com/gfx-rs/wgpu/issues/1871 is potentially relevant in that case). In addition, a downside is that returns to the UI runloop would be delayed, likely impacting other tasks including being responsive to accessibility requests.
We can have an async task that fully owns the GPU and associated resources. It would operate as a loop that receives a scene through an async channel, submits as many command buffers as needed with await points for the readback, then returns to the top of the loop after the last such submission. The UI runloop would create a scene, send it over a channel, and immediately return to the runloop. On native, the task would run in the threadpool of an async executor (such as tokio), and on native it would be invoked by spawn_local. This is appealing in use cases where a Vello-driven UI would be the sole user of the GPU, but poses serious challenges when that is to be shared.
We can have a similar async task, but share access to the GPU by wrapping at least the wgpu
Device
object (and likely other types) in anArc<Mutex<_>>
. This makes it possible, at least in theory, to integrate with other GPU clients, but complicates that integration, as other such clients have to cooperate with the locking discipline. It's been suggested that wgpu implementClone
on Device and related types to directly support such sharing, and is worth noting that this is not a problem in JavaScript, as all such references are implicitly shareable.We can consider other possibilities where async is not represented by await points in an async Rust function, but rather by a state machine of some kind. The host would be responsible for advancing the state machine on completion of submitted command buffers. This is potentially the most flexible approach, but is complex, and also requires the host to support async.
Similar to (1) but with mechanisms in place to recover from error and allocate for the next frame. To minimize visual disturbance, there could be a texture holding the previous frame. The fine shader could then blit from that texture when it detects failure. This is potentially the least invasive solution regarding integration, but deliberately makes jank possible.
A few other comments. There are other applications that can potentially benefit from readback (one is to do hit testing and collision detection on GPU). However, in GL it has historically been associated with horrible performance (for underlying reasons similar to what's been outlined above). In game contexts, a reasonable tradeoff may be to defer the results of readback for one frame, but that's less desirable here, as it results in a frame not being rendered.
It is worth exploring whether there may be practical extensions to the GPU execution model that eliminate the need for CPU readback. As mentioned above, ability to allocate or launch work from within a shader would help enormously.
I'm interested in hearing more about applications and use cases, to decide what we should be building. Some of it is fairly complex, other choices create integration problems. There's no one choice that seems a clear win.
A bit more discussion is in the Runloop, async, wgpu Zulip thread, and there are links from that to other resources.