Clarify in spec that getMappedRange() ranges should be used to optimize map invalidations/copies

vmilea commented 4 months ago

subject was: "Allow specifying the written range when unmapping a buffer"

I noticed an issue while benchmarking CPU-to-GPU data streaming that can degrade performance of memory-mapped transfers, making them worse than queue.writeBuffer(). While there are partial workarounds, a minor API change seems necessary.

Problem

WebGPU supports buffer mapping to transfer data directly to / from the GPU. Applications can use a ring of staging buffers to continuously feed the GPU with new data each frame, as described in this article:

At the start of the frame, try to obtain a staging buffer from the ring. If the ring is empty, create a new buffer of predetermined size with usage = MAP_WRITE | COPY_SRC and mappedAtCreation = true.
Write the staging data into the mapped range.
Unmap the buffer.
Submit a command to copy the staged data to the destination buffer.
Call mapAsync() with a callback to reclaim the buffer.

In this setup the buffer size is fixed, let's say 128MB. Assuming the app stays within that budget, the staging ring stabilizes at size 3. So the memory overhead is tolerable and this approach usually results in faster transfers. This is indeed the case when enough data is streamed to fill the staging buffer. I've measured gains of 10-40% vs. writeBuffer() in heavy workloads.

The problem arises when the amount of data streamed varies per frame. A stationary scene in a game may have little data to transfer, in which case writeBuffer() becomes faster, while the mapping version still incurs the cost of copying the entire staging buffer. Why?

When calling mapAsync(), the app doesn't know how much data it will put in the staging buffer. So it requests the entire range.
Due to architectural constraints, the mapping is typically to an intermediate allocation in shared memory between the GPU process and the content process. Therefore, after unmap(), the GPU process has to copy the data to its destination and it uses the range from mapAsync().

Proposal

Could we trim this final copy to the actually written range? Since the amount of data isn't known at the time of mapAsync() or even getMappedRange(), I propose an API tweak like unmap(optional dirtyOffset, optional dirtySize). Alternatively, a method like addDirtyRange(offset, size) could support multiple ranges.

Alternatives considered

In theory, written ranges can be inferred from getMappedRange(offset, size) as mentioned in the WebGPU Explainer. However, Chrome doesn't seem to perform this optimization. More important, it's inconvenient for the app to precalculate the minimal range.
Keep the API as-is and rely on heuristics to sidestep the issue in app. E.g. prefer writeBuffer() for small transfers or use smaller staging buffers.

vmilea commented 3 months ago

I've created a live sample to illustrate the issue with mapping large buffers. The code is public and works on Chrome or Firefox Nightly. You can adjust the amount of data uploaded per frame, the buffer size, and the update mode:

Default - Uses queue.writeBuffer().
Map (whole buffer) - Uses a ring of staging buffers, mapping a whole buffer each frame. If the buffer is large, performance takes a hit regardless of how much data was actually written because we lack a way of informing the GPU process.
Map (min size) - Same as above, but adjusts future mappings based on the amount of data written in the current frame. This is pure guesswork: if it turns out more space is needed than was mapped, then we fall back to writeBuffer().

While the third approach gives the best performance, it's really just a brittle hack forced by current API limitations.

Here are frame times on Chrome when uploading 20MB of data:

Device	Default	Map (min size - 20MB)	Map (whole buffer - 128MB)	Map (whole buffer - 256MB)
Intel 12900K + NVIDIA RTX 4090	11.3ms	9.1ms	14.5ms	21.5ms
Apple M1	17.5ms	15.5ms	20ms	25ms

As you can see, mapping is faster, so long as we keep the GPU process aware of the written range. Unfortunately the range is defined by mapAsync(), which as already explained isn't practical. The typical app wants to map the whole range for writing, then use as much as it needs. I think a simple tweak to the unmap() API would completely solve this problem.

Kangz commented 3 months ago

Thank you for the detailed investigation, it is extremely clear!

In theory, written ranges can be inferred from getMappedRange(offset, size) as mentioned in the WebGPU Explainer. However, Chrome doesn't seem to perform this optimization. More important, it's inconvenient for the app to precalculate the minimal range.

That's definitely the intent and Chromium should implement that optimization at some point, we just didn't get around to it yet. Could you detail why it would be hard to use getMappedRange for this optimization on the application side? I created a Chromium issue for this.

vmilea commented 3 months ago

That's definitely the intent and Chromium should implement that optimization at some point, we just didn't get around to it yet.

That makes the problem more tractable. Thanks for opening the issue!

Could you detail why it would be hard to use getMappedRange for this optimization on the application side?

Because the caller of getMappedRange() may not know in advance how much data will be written. In cases like streaming world assets or dynamically generated content, it can be difficult to precalculate the exact amount. Here is an example.

Let's imagine a StagingBuffer that calls getMappedRange() once at the start of the frame, and can linearly allocate spans from the mapped ArrayBuffer. Scene elements then use it to write varying amounts of data. But now we have unintended coupling, with an extra scene traversal needed to precalculate total size before getMappedRange()!

Alternatively, a developer may be tempted to simply call getMappedRange() for each scene element, hoping for the same effect on optimization without the extra steps. Unfortunately, this will backfire due to the overhead of tracking and validating the numerous range objects.

vmilea commented 3 months ago

As a tangent, why does getMappedRange() exist? I consider it too rigid for tracking written ranges. Does it provide any other benefits compared to mapAsync(mode, offset, size) returning Promise<ArrayBuffer> directly?

kainino0x commented 3 months ago

getMappedRange() is synchronous, allowing you to determine which ranges need to actually be mapped at the last moment. If you have some kind of streaming write, and so you know the offset but not the size, you can getMappedRange() in blocks until you reach the end of the stream. (getMappedRange() can be called multiple times for ranges within a single mapAsync(), as long as they don't overlap.)

kainino0x commented 3 months ago

Could we trim this final copy to the actually written range? Since the amount of data isn't known at the time of mapAsync() or even getMappedRange(), I propose an API tweak like unmap(optional dirtyOffset, optional dirtySize). Alternatively, a method like addDirtyRange(offset, size) could support multiple ranges.

I don't think this is possible because it would require the browser to trust the webpage about which ranges it wrote. If the webpage writes a range and doesn't dirty it, it would result in undefined behavior for whether the data got written, or not written, or partially written, etc.

vmilea commented 3 months ago

Thanks for the feedback.

If you have some kind of streaming write, and so you know the offset but not the size, you can getMappedRange() in blocks until you reach the end of the stream. (getMappedRange() can be called multiple times for ranges within a single mapAsync(), as long as they don't overlap.)

I mentioned this earlier but didn't go into detail. Having to call getMappedRange() repeatedly makes the API chatty and adds overhead for validation and ArrayBuffer management. It depends on the WebGPU implementation being efficient with GC, detecting range overlap, and merging adjacent ranges. So performance-wise you can shoot yourself in the foot by getting, for example, 1000 mapped ranges for uniforms each frame. TBH, I don't know how plausible that is. The obvious workaround would be to get larger ranges and then suballocate.

I don't think this is possible because it would require the browser to trust the webpage about which ranges it wrote. If the webpage writes a range and doesn't dirty it, it would result in undefined behavior for whether the data got written, or not written, or partially written, etc.

That's a fair point. If the goal of WebGPU is to have well defined behavior, even though it's just stale data in a buffer after misusing the API, then getMappedRange() starts to make sense.

vmilea commented 3 months ago

Given the mentioned constraints, and the expected browser optimizations, I think it's fine to close this issue.

written ranges can be inferred from getMappedRange(offset, size) as mentioned in the WebGPU Explainer

Could we add an implementation note in the spec proper? I think it's of interest to both API users and implementers.

kainino0x commented 2 months ago

Yes, let's do that. Thanks Corentin for marking this copyediting so we can do that :)

gpuweb / gpuweb