Open vmilea opened 4 months ago
I've created a live sample to illustrate the issue with mapping large buffers. The code is public and works on Chrome or Firefox Nightly. You can adjust the amount of data uploaded per frame, the buffer size, and the update mode:
queue.writeBuffer()
.writeBuffer()
.While the third approach gives the best performance, it's really just a brittle hack forced by current API limitations.
Here are frame times on Chrome when uploading 20MB of data:
Device | Default | Map (min size - 20MB) | Map (whole buffer - 128MB) | Map (whole buffer - 256MB) |
---|---|---|---|---|
Intel 12900K + NVIDIA RTX 4090 | 11.3ms | 9.1ms | 14.5ms | 21.5ms |
Apple M1 | 17.5ms | 15.5ms | 20ms | 25ms |
As you can see, mapping is faster, so long as we keep the GPU process aware of the written range. Unfortunately the range is defined by mapAsync()
, which as already explained isn't practical. The typical app wants to map the whole range for writing, then use as much as it needs. I think a simple tweak to the unmap()
API would completely solve this problem.
Thank you for the detailed investigation, it is extremely clear!
In theory, written ranges can be inferred from getMappedRange(offset, size) as mentioned in the WebGPU Explainer. However, Chrome doesn't seem to perform this optimization. More important, it's inconvenient for the app to precalculate the minimal range.
That's definitely the intent and Chromium should implement that optimization at some point, we just didn't get around to it yet. Could you detail why it would be hard to use getMappedRange
for this optimization on the application side? I created a Chromium issue for this.
That's definitely the intent and Chromium should implement that optimization at some point, we just didn't get around to it yet.
That makes the problem more tractable. Thanks for opening the issue!
Could you detail why it would be hard to use getMappedRange for this optimization on the application side?
Because the caller of getMappedRange()
may not know in advance how much data will be written. In cases like streaming
world assets or dynamically generated content, it can be difficult to precalculate the exact amount. Here is an example.
Let's imagine a StagingBuffer
that calls getMappedRange()
once at the start of the frame, and can linearly allocate spans from the mapped ArrayBuffer
. Scene elements then use it to write varying amounts of data. But now we have unintended coupling, with an extra scene traversal needed to precalculate total size before getMappedRange()
!
Alternatively, a developer may be tempted to simply call getMappedRange()
for each scene element, hoping for the same effect on optimization without the extra steps. Unfortunately, this will backfire due to the overhead of tracking and validating the numerous range objects.
As a tangent, why does getMappedRange()
exist? I consider it too rigid for tracking written ranges. Does it provide any other benefits compared to mapAsync(mode, offset, size)
returning Promise<ArrayBuffer>
directly?
getMappedRange() is synchronous, allowing you to determine which ranges need to actually be mapped at the last moment. If you have some kind of streaming write, and so you know the offset but not the size, you can getMappedRange() in blocks until you reach the end of the stream. (getMappedRange() can be called multiple times for ranges within a single mapAsync(), as long as they don't overlap.)
Could we trim this final copy to the actually written range? Since the amount of data isn't known at the time of
mapAsync()
or evengetMappedRange()
, I propose an API tweak likeunmap(optional dirtyOffset, optional dirtySize)
. Alternatively, a method likeaddDirtyRange(offset, size)
could support multiple ranges.
I don't think this is possible because it would require the browser to trust the webpage about which ranges it wrote. If the webpage writes a range and doesn't dirty it, it would result in undefined behavior for whether the data got written, or not written, or partially written, etc.
Thanks for the feedback.
If you have some kind of streaming write, and so you know the offset but not the size, you can getMappedRange() in blocks until you reach the end of the stream. (getMappedRange() can be called multiple times for ranges within a single mapAsync(), as long as they don't overlap.)
I mentioned this earlier but didn't go into detail. Having to call getMappedRange()
repeatedly makes the API chatty and adds overhead for validation and ArrayBuffer
management. It depends on the WebGPU implementation being efficient with GC, detecting range overlap, and merging adjacent ranges. So performance-wise you can shoot yourself in the foot by getting, for example, 1000 mapped ranges for uniforms each frame. TBH, I don't know how plausible that is. The obvious workaround would be to get larger ranges and then suballocate.
I don't think this is possible because it would require the browser to trust the webpage about which ranges it wrote. If the webpage writes a range and doesn't dirty it, it would result in undefined behavior for whether the data got written, or not written, or partially written, etc.
That's a fair point. If the goal of WebGPU is to have well defined behavior, even though it's just stale data in a buffer after misusing the API, then getMappedRange()
starts to make sense.
Given the mentioned constraints, and the expected browser optimizations, I think it's fine to close this issue.
written ranges can be inferred from getMappedRange(offset, size) as mentioned in the WebGPU Explainer
Could we add an implementation note in the spec proper? I think it's of interest to both API users and implementers.
Yes, let's do that. Thanks Corentin for marking this copyediting
so we can do that :)
subject was: "Allow specifying the written range when unmapping a buffer"
I noticed an issue while benchmarking CPU-to-GPU data streaming that can degrade performance of memory-mapped transfers, making them worse than
queue.writeBuffer()
. While there are partial workarounds, a minor API change seems necessary.Problem
WebGPU supports buffer mapping to transfer data directly to / from the GPU. Applications can use a ring of staging buffers to continuously feed the GPU with new data each frame, as described in this article:
usage = MAP_WRITE | COPY_SRC
andmappedAtCreation = true
.mapAsync()
with a callback to reclaim the buffer.In this setup the buffer size is fixed, let's say 128MB. Assuming the app stays within that budget, the staging ring stabilizes at size 3. So the memory overhead is tolerable and this approach usually results in faster transfers. This is indeed the case when enough data is streamed to fill the staging buffer. I've measured gains of 10-40% vs.
writeBuffer()
in heavy workloads.The problem arises when the amount of data streamed varies per frame. A stationary scene in a game may have little data to transfer, in which case
writeBuffer()
becomes faster, while the mapping version still incurs the cost of copying the entire staging buffer. Why?mapAsync()
, the app doesn't know how much data it will put in the staging buffer. So it requests the entire range.unmap()
, the GPU process has to copy the data to its destination and it uses the range frommapAsync()
.Proposal
Could we trim this final copy to the actually written range? Since the amount of data isn't known at the time of
mapAsync()
or evengetMappedRange()
, I propose an API tweak likeunmap(optional dirtyOffset, optional dirtySize)
. Alternatively, a method likeaddDirtyRange(offset, size)
could support multiple ranges.Alternatives considered
getMappedRange(offset, size)
as mentioned in the WebGPU Explainer. However, Chrome doesn't seem to perform this optimization. More important, it's inconvenient for the app to precalculate the minimal range.writeBuffer()
for small transfers or use smaller staging buffers.