gpuweb / gpuweb

Where the GPU for the Web work happens!
http://webgpu.io
Other
4.68k stars 308 forks source link

Out-of-bounds clamping brings big perf regression on TFJS ML demos. #1202

Open qjia7 opened 3 years ago

qjia7 commented 3 years ago

Recently, the TFJS ML demos on webgpu backend have big perf dropping with robust buffer access(RBA) enabled in chrome. Here we extracted a very simple matrix multiplication webgpu example to show the performance impact. Below is the data we collected on different GPUs.

Benchmark MatMul(1024x1024)      
    Enable RBA (ms) Disable RBA (ms) --disable-dawn-robustness  
MacOS Radeon Pro 555X 6.594 5.454 iteration=400
  Intel Iris Plus(ICL) 8.022 5.137 iteration=400
Windows UHD630(CFL) 36.98 21.807 iteration=400
  NV GTX1080 Ti 0.626 0.468 iteration=3000

From above table, we can see that there are 17%~40% perf regressions on different GPUs. We can see the similar regressions on the ML models.

Currently, the RBA is implemented by inserting clamp for all array index accessing, which include storage buffer accessing, shared memory array accessing, local variable array accessing. For example: acc[innerRow][innerCol] = 0.0f; -> acc[clamp(innerRow, 0, 3)][clamp(innerCol, 0, 3)] = 0.0f;

Obviously, out-of-bounds clamping brings big perf overhead on various GPUs, specially on integrated GPUs. (Some related topics gfx-rs/naga#35 gfx-rs/naga#33 gfx-rs/naga#311 gfx-rs/naga#955)

Here are some of our findings/issues that we want to discuss:

Kangz commented 3 years ago

Thank you for posting this, a 20-40% perf regression for sandboxing is quite big. It is expected that we'd lose some performance but this is quite a lot so we should figure out if something can be done on the WebGPU / WGSL spec side to allow for sandboxing that's lighter weight. (or investigate the set of optimizations that would reclaim a good chunk of that performance)

kvark commented 3 years ago

It's great to have these numbers such early, thank you! The suggestions about D3D12 and robustness2 are in line with what we are going to do in wgpu/Gecko.

Will it be a problem if clamping all of bounds to the last index for shared memory and buffer? It means it’s possible that multiple threads write different values to the same memory which results undefined behaviors. Maybe we should discard out of bounds writing.

That's an interesting question! I don't think such writes would bear any more risk than other writes. There is nothing in WebGPU that can guarantee that your compute threads aren't racing to write to the same location in a storage buffer. These are the races we allow, since we basically have no other choice. Note that this is not Undefined Behavior but rather Undefined Result. WebGPU developers need to make sure their storage writes are properly organized to avoid that, and of course they better not try writing out-of-bounds.

Personally, I like d3d12 semantics of discards much more than clamping, and I'd love to reach a point where our prototype can enforce this behavior consistently on all the backing APIs.

A few follow-up questions to you:

  1. Can you clarify in which cases the RBA feature was used in Vulkan? I.e. was it used with "--disable-dawn-robustness"? Was it used without this param?
  2. In acc[innerRow][innerCol] = 0.0f;, how was this array defined? What are the innerRow and innerCol values? I'm wondering if these are coming from a loop somewhere, and we could deduce that the indexing is always in-bounds.
qjia7 commented 3 years ago
  1. Can you clarify in which cases the RBA feature was used in Vulkan? I.e. was it used with "--disable-dawn-robustness"? Was it used without this param?

Currently, RBA is enabled by default in chrome canary. You can pass '--disable-dawn-robustness' flag to disable RBA. I didn't test vulkan backend. But from our test on windows, SPIRV-Tools will insert clamp in all places where array indexing is used. I don't think it does any optimizations to deduce whether the indexing is always in-bounds so that clamping is not needed.

  1. In acc[innerRow][innerCol] = 0.0f;, how was this array defined? What are the innerRow and innerCol values? I'm wondering if these are coming from a loop somewhere, and we could deduce that the indexing is always in-bounds.

acc[innerRow][innerCol] = 0.0f; comes from my matmul example shader. You can download it (see the description part) and have a try or just check the shader part. You are right. They are coming from a loop. Agree with you that it’s not necessary to do clamp for innerRow and innerCol semantically since they are not possible out of bounds. But currently, these optimizations are not supported. For this matmul example, the time mainly consumed on shared memory index clamping since it's heavily used.

    float acc[4][4] = { { 0.0f, 0.0f, 0.0f, 0.0f }, { 0.0f, 0.0f, 0.0f, 0.0f }, { 0.0f, 0.0f, 0.0f, 0.0f }, { 0.0f, 0.0f, 0.0f, 0.0f } };
    for (int innerRow = 0; innerRow < 4; innerRow++)
    {
        for (int innerCol = 0; innerCol < 4; innerCol++)
        {
            acc[innerRow][innerCol] = 0.0f;
        }
    }

becomes

    float acc[4][4] = { { 0.0f, 0.0f, 0.0f, 0.0f }, { 0.0f, 0.0f, 0.0f, 0.0f }, { 0.0f, 0.0f, 0.0f, 0.0f }, { 0.0f, 0.0f, 0.0f, 0.0f } };
    for (int innerRow = 0; innerRow < 4; innerRow++)
    {
        for (int innerCol = 0; innerCol < 4; innerCol++)
        {
            acc[clamp(innerRow, 0, 3)][clamp(innerCol, 0, 3)] = 0.0f;
        }
    }

Attach the original matmul GLSL shader, the translated HLSL shader without RBA and translated HLSL shader with RBA in case you want to know more details.

Kangz commented 3 years ago
  1. Can you clarify in which cases the RBA feature was used in Vulkan? I.e. was it used with "--disable-dawn-robustness"? Was it used without this param?

That flag toggles Vulkan robustness off as well as all the shader instrumentation that does array clamping on all backends. Note that the numbers @qjia7 gave are on D3D12 and Metal though.

The suggestions about D3D12 and robustness2 are in line with what we are going to do in wgpu/Gecko.

I thought the group had required array clamping for all accesses? Or is that only for arrays that are not in uniform or storage memory?

kvark commented 3 years ago

I thought the group had required array clamping for all accesses? Or is that only for arrays that are not in uniform or storage memory?

No, as far as I'm aware. The group agreed to shape the API in a way that allows clamping for implementations that wish to do so. We've yet to see if it's the best way to handle RBA. This very issue puts this idea under question :).

Jiawei-Shao commented 3 years ago

There is nothing in WebGPU that can guarantee that your compute threads aren't racing to write to the same location in a storage buffer. These are the races we allow, since we basically have no other choice. Note that this is not Undefined Behavior but rather Undefined Result.

Hi @kvark, I have a question here: if we decide to allow that the last element of an array in SSBO or shared memory to be written into "undefined values", I wonder if it should also be allowed that we keep that value unchanged (the default behavior in HLSL), as "keeping value unchanged" can also be treated as one possibility of the "undefind values" here. What do you think?

kvark commented 3 years ago

@Jiawei-Shao as I said in https://github.com/gpuweb/gpuweb/issues/1202#issuecomment-721791182, my preference is to enforce D3D out-of-bounds semantics across WebGPU, which would mean the write out-of-bound is ignored. There is still more research and experiments to be done before settling on this approach (at least for us in Gecko/wgpu land).

qjia7 commented 3 years ago

my preference is to enforce D3D out-of-bounds semantics across WebGPU, which would mean the write out-of-bound is ignored

+1. Hope WGSL can consider this issue seriously. At least, the clamping without any optimization is unacceptable.

grorg commented 3 years ago

Discussed at the 2020-12-01 meeting.

kainino0x commented 3 years ago

Resolution: Tint needs to add some basic range analysis so we can test how much better the performance is - something that would at least work for the case above (where clamp(innerRow, 0, 3) is guarded by innerRow < 4 so the analysis is as simple as it gets - except for the issue of signed int).

kainino0x commented 3 years ago

It's possible to do the range analysis to understand that in these loops, the value can't go below 0, but for a first pass to understand performance, we could just assume array indices become unsigned, and do the simplest analysis.

Kangz commented 3 years ago

Range analysis should help with for accesses into fixed-size arrays in the shader. But isn't the clamping of the index for the unsized array in the storage buffer a large part of what makes the performance drop?

kvark commented 3 years ago

@Kangz do we have any numbers for that overhead (of unsized array clamping)? Since we are able to rely on D3D12 RBA and Vulkan robustness2, I see the unsized arrays much less of a problem.

Kangz commented 3 years ago

Oh, reading the minutes I thought it was decided to rely on clamping and not RBA. I don't have concerns if we're allowed to use RBA. We don't have numbers at this time but I opened a Dawn bug to have a perf test representative of TF.js so we can easily perf things with different options.

qjia7 commented 3 years ago

do we have any numbers for that overhead (of unsized array clamping)? Since we are able to rely on D3D12 RBA and Vulkan robustness2, I see the unsized arrays much less of a problem.

I ever did some experiment. For the attached matmul example, the time mainly consumed on shared memory index clamping.

kainino0x commented 3 years ago

To clarify, you mean mm_Asub/mm_Bsub in the attached shader? Note acc (mentioned above) is a local array, not shared memory.

qjia7 commented 3 years ago

To clarify, you mean mm_Asub/mm_Bsub in the attached shader? Note acc (mentioned above) is a local array, not shared memory.

Yes. I mean mm_Asub/mm_Bsub. In fact, the most heavy one is below snippet:

BCached[inner] = mm_Bsub[k][tileCol + inner];
... ...
ACached = mm_Asub[tileRow + innerRow][k];

I wrote a native d3d12 example with the same shader. Without any clamping, the time is 20.388 ms. Adding clamping only for above two sentences, the time becomes 33.426 ms. The only change is shown below:


         for (int inner = 0; inner < ColPerThread; inner++) {
-          BCached[inner] = mm_Bsub[k][tileCol + inner];
+          BCached[inner] = mm_Bsub[clamp(k, 0, 63)][clamp(tileCol + inner, 0, 63)];
         }

         for (int innerRow = 0; innerRow < RowPerThread; innerRow++) {
-          ACached = mm_Asub[tileRow + innerRow][k];
+          ACached = mm_Asub[clamp(tileRow + innerRow, 0, 63)][clamp(k, 0, 63)];
           for (int innerCol = 0; innerCol < ColPerThread; innerCol++) {
             acc[innerRow][innerCol] += ACached * BCached[innerCol];
           }
kvark commented 3 years ago

It would be useful to have an actual confirmation that the slowdown is due to the increased ALU cost (for clamp or min) and not due to something else that we are missing. This can likely be concluded from running the workload under the vendor-specific tools, such as AMD Radeon Profiler and NVidia ShaderPerf.

krogovin commented 3 years ago

It is not just the ALU that is harmful: it is the dependency thing that is harmful as well. The clamp adds a min and max which need to make their way through the ALU pipeline before they can be used. If the shader had some more ALU between computing the output of the clamp and the array access the shader compiler would be able to hide the instruction latency much better as well.

Other optimization strategies are also lost under some circumstances. As a toy example:

for(int k = begin; k != end; ++k) { A[k] += something; }

can be optimized to:

for (Aptr = A + begin, Aend = A + end; Aptr != Aend; ++Aptr) { *Aptr += something; }

though in GPU "Aptr" is really going to be some offset and the write is going to be a message to a port to write the value. Adding clamp() to access on A[k] kills the above optimization.

qjia7 commented 3 years ago

We have added the matmul vec4 version to tfjs. And the robustness impact for perf is not that significant anymore. However, for the matmul float version. This issue still exists. In above comment, I ever said that the shared memory index clamping is the factor that most affects performance. To further analysis it, we added a robustness perf test to dawn. Also paste the data we collected here. (Tested on Intel CFL with A[512, 512] x B[512, 512])

shader | enable robustness | disable robustness -- | -- | -- MatMulFloatOneDimSharedArray | 5383 us | 3105 us MatMulFloatTwoDimSharedArray | 4788 us | 2608 us MatMulVec4OneDimSharedArray | 3070 us | 1743 us MatMulVec4TwoDimSharedArray | 1840 us | 1802 us

From above data, we can see that 1) robustness has big impact for the one dimensional shared array, 2) robustness has little impact for MatMulVec4TwoDimSharedArray on CFL.

krogovin commented 3 years ago

It still has big impact on MatMulFloatTwoDimSharedArray. I suspect that the reason that MatMulVec4TwoDimSharedArray is not as heavily impact is because the vec4 and clamping ops cat get sort-of-pipelined (via thread switching on the EU's) because the ratio of clamps to floating point ops and read messages is 25% compared to MatMulFloatTwoDimSharedArray.

I am not too sure what is the exact real purpose to enforce clamping really. if one is trying to protect against GPU hang, this is the tiny issue because a bad end condition on a loop will make the GPU hang. I know that on Intel GPU's starting in Gen8, the MMU is per hardware context, so there is not an issue with a read from one context hitting memory from another for a bad read (or write) under the assumption that each WebGPU thingy is given a separate context (and this really should be the case strongly). I strongly suspect that both NVIDIA and AMD have similar bits as well.

If want to go the route of inserting clamping into all array access, that is still different than out-of-bound writes are ignored and out-of-bound reads give 0. Relying on a compiler tool chain (be it SPIR-V toolchain or drivers) to optimize out redundant clamping is not a reliable strategy for performance either.

qjia7 commented 3 years ago

It seems that there is no that big gap between MatMulVec4OneDimSharedArray and MatMulVec4TwoDimSharedArray using DXC instead of FXC. Tested on Intel TGL.

shader | enable robustness(FXC) | disable robustness(FXC) | enable robustness(DXC) | disable robustness(DXC) -- | -- | -- | -- | -- MatMulFloatOneDimSharedArray | 951.702 us | 519.177 us | 1109 us | 509.18 us MatMulFloatTwoDimSharedArray | 691.992 us | 537.087 us | 733.778 us | 491.924 us MatMulVec4OneDimSharedArray | 804.257 us | 402.68 us | 450.281 us | 398.872 us MatMulVec4TwoDimSharedArray | 421.287 us | 411.32 us | 427.052 us | 395.561 us

Kangz commented 3 years ago

@krogovin unfortunately we can't rely on each WebGPU device being a different MMU context. In many cases there is one GPU MMU context for per process which would require browsers to have one GPU process per GPUDevice. Also it's GPU MMUs aren't as tested as CPU ones and it's not clear we can rely on them. WebGPU will try to use robust access behavior when possible, but we need a fallback when it isn't present, like in Metal and for data in shared memory.

@qjia7 thanks a lot for the additional data. As we discussed in the meeting, it seems that the driver compiler is able to optimize out some of the clamping when it is able to prove that data is in-range, especially using range-value analysis (RVA). FXC seems to be able to do only simpler RVA so it can only optimize when indices in the shared memory array are simple (iterating over a fixed range), while DXC is able to deduce the range for the combined index in the 1D case.

This gives me great hope that with good compiler tooling, we should be able to optimize out most of the bounds check and not lost performance with them. However we'll need to provide good tooling so that developers can check that bounds check do indeed get optimized out.

krogovin commented 3 years ago

Hi,

When the purpose of clamping is to make sure that reads from "web-process"-A do not see the data from "web-process"-B that just means that browser implementation has placed the GPU usage on a common "GPU-process"="GPU HW context". If there are some issue with having many GPU HW-contexts (such as in the past of some mobile GPU's), then so be it. However, on desktop (and many mobile too), there is no reason to have very few GPU HW-context. Saying the "GPU's MMU's are not well tested" is borderline silliness. Using the GPU just won't work unless the MMU is working since that is what maps the GPU-address space to physical memory (indeed on Intel GPU's starting at Gen8 that address space is allocated by the driver underneath). This smells like that the clamping of indices is needed not so much for actual GPU implementations anymore, but rather the design decision of some web-engines having a single process that uses a single GPU HW-context for all GPU work across all "web-processes" instead of having a GPU HW-context per "web-process" that would mirror the already present "CPU-process" per "web-process".

Also the reason why the vec4 case turns out not-so-bad is not because of compiler optimizations, but because the read is 16-bytes wide per clamping, i.e. the read gives 4 values per-clamp where as the scalar case there is a clamp for each float. That clamping per-scalar read is much harder to hide via thread switching.

Kangz commented 3 years ago

This smells like that the clamping of indices is needed not so much for actual GPU implementations anymore, but rather the design decision of some web-engines having a single process that uses a single GPU HW-context for all GPU work

Yes that's the design decision of all Web engines, and changing it would be another multi-year effort. But also being in the same process is what allows WebGPU to "synchronously" share data with other GPU data in the page like images, videos etc. Anyways this is probably not the correct issue to challenge this fundamental point of browser architecture.

Also the reason why the vec4 case turns out not-so-bad is not because of compiler optimizations, but because the read is 16-bytes wide per clamping, i.e. the read gives 4 values per-clamp where as the scalar case there is a clamp for each float. That clamping per-scalar read is much harder to hide via thread switching.

Of course vec4 help reduce the clamping needed for the storage buffer. My comment was about the clamping of non-storage buffer variable and the difference of result of MatMulVec4OneDimSharedArray between DXC and FXC.

krogovin commented 3 years ago

All I can hope at this point is that the browser implementations take advantage of d3d12's guarantee and use VK_EXT_robustness2 whenever it is available. In addition, when that is not available, a hard decision needs to be made on out-of-bounds portability:

There is some subtly to this as well with regards to an array representing a range into a GPU buffer as well that likely needs to considered as well.

Kangz commented 3 years ago

This is being discussed in gfx-rs/naga#1722 but in general we want to rely on D3D12's behavior and Vulkan's robustAccess feature when available.

kainino0x commented 3 years ago

@qjia7 we chatted about this in the WGSL office hours today, and what we need to do is make sure Dawn/Tint are not enabling redundant bounds checks for accesses that are already guaranteed by D3D12 or Vulkan robustness. From a quick code inspection it appears that we are emitting redundant bounds checks.

Unfortunately I believe this will only help for global memory, and not for workgroup memory. Fortunately for workgroup memory, range-analysis compiler optimizations are more likely to work well. I think what we would ideally like to do is compare the two following HLSL shaders so that we can find out how much the optimizations could possibly help:

I don't know precisely how to achieve this, but maybe the perf tests you worked on in Dawn can be adapted for this. Would your team be able to look into it?

kainino0x commented 3 years ago

Corentin pointed to https://crbug.com/tint/779 which tracks being able to turn BoundArrayAccessors on and off for different storage types. It would be useful to see the performance on D3D12 with and without the redundant clamps, as well as on Vulkan with robustness only, with clamps only, and with both.

As a proxy for testing the hoisting performance in HLSL directly, I think this would work, regardless of tint:779:

  1. Manually save out the result of transforming the original WGSL using the BoundArrayAccessors transform, then (if not done in tint:779) manually remove the bounds checks that are unnecessary with robustness
  2. with BoundArrayAccessors disabled, compare performance of:

    • That transformed WGSL
    • Same, but manually modify the transformed WGSL to hoist the bounds checks

    (making sure to inspect the output HLSL to make sure it has the intended effect, and other WGSL transforms aren't interfering)

qjia7 commented 2 years ago

I raised this issue on GPU Web meeting 2021-09-13. We hope wgsl compiler can provide a special optimization path for shared workgroup memory (shmem) robust access to avoid inserting any additional clamping when the shmem index is not possible to exceed the array size. For example, use the local invocation id as the index and the work group size is equal to or smaller than shmem size. Our feedbacks are 1) suggest to discuss it in wgsl meeting; 2) could be a Tint / Naga enhancement.

cc @ben-clayton

ben-clayton commented 2 years ago

I raised this issue on GPU Web meeting 2021-09-13. We hope wgsl compiler can provide a special optimization path for shared workgroup memory (shmem) robust access to avoid inserting any additional clamping when the shmem index is not possible to exceed the array size. For example, use the local invocation id as the index and the work group size is equal to or smaller than shmem size. Our feedbacks are 1) suggest to discuss it in wgsl meeting; 2) could be a Tint / Naga enhancement.

cc @ben-clayton

This is on our TODO list, but given the size of the task, complexity and risk-to-security this is something that needs to be carefully planned before implementing. There are also a number of higher priority tasks for the Tint team right now, so I'm afraid it is unlikely that this is something that will be addressed in immediate future.

The request for simple switches for controlling robustness is far easier to implement, but still needs a thorougher investigation to ensure that the hardware really does prevent OOB accesses when enabling these switches. The Tint team can prioritize this feature, but it will also require Dawn work to correctly and safely use.

kvark commented 2 years ago

Please please don't call it "shmem". This term is well established for process-shared memory on CPU. Just call it "workgroup memory". I filed https://github.com/gfx-rs/wgpu/issues/4360 in Naga to track this.

kdashg commented 2 years ago
WGSL meeting minutes 2022-02-08 * ?: May want to come back after implementation experience * ?: To start, rely on implementations to try to fix this (bounds check elision etc) * MM: What could we even do from a spec standpoint? * JB: Can add things like iterators/foreach that don’t need bounds checks (like Rust) * MM: Do we need this for v1? Probably not?