Closed Abhishek-Varma closed 1 year ago
This used to work at some point and stopped working in the recent past. We have pinned to an earlier version to workaround.
This is because we have a transposed linalg.generic
fused together with linalg.matmul
:
%10 = linalg.matmul
ins(%4, %5 : tensor<4096x512xf16>, tensor<512x512xf16>)
outs(%9 : tensor<4096x512xf16>) -> tensor<4096x512xf16>
%11 = linalg.generic {
indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>,
affine_map<(d0, d1) -> (d1, d0)>],
iterator_types = ["parallel", "parallel"]
} ins(%10, %6 : tensor<4096x512xf16>, tensor<512xf16>) outs(%7 : tensor<512x4096xf16>) {
^bb0(%in: f16, %in_0: f16, %out: f16):
%12 = arith.addf %in, %in_0 : f16
linalg.yield %12 : f16
} -> tensor<512x4096xf16>
So bufferization allocated a buffer from workgroup memory to hold the intermediate result for linalg.matmul
, and that exceeds the total amount of allowed workgroup memory.
The issue is likely at a higher level. I'd be interested to know how we generate such dispatches in flow. Did you perform layout transformations and handle transposes somehow causing this?
ew, using a temporary buffer is really unfortunate - one shouldn't be needed here for a simple bias add.
So it was working until about a week ago and we hit the issue. We have pinned torch-mlir to an older version to avoid this issue. Is the issue in higher level in torch-mlir or in flow dialect ?
This is because we have a transposed
linalg.generic
fused together withlinalg.matmul
:%10 = linalg.matmul ins(%4, %5 : tensor<4096x512xf16>, tensor<512x512xf16>) outs(%9 : tensor<4096x512xf16>) -> tensor<4096x512xf16> %11 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d1, d0)>], iterator_types = ["parallel", "parallel"] } ins(%10, %6 : tensor<4096x512xf16>, tensor<512xf16>) outs(%7 : tensor<512x4096xf16>) { ^bb0(%in: f16, %in_0: f16, %out: f16): %12 = arith.addf %in, %in_0 : f16 linalg.yield %12 : f16 } -> tensor<512x4096xf16>
So bufferization allocated a buffer from workgroup memory to hold the intermediate result for
linalg.matmul
, and that exceeds the total amount of allowed workgroup memory.The issue is likely at a higher level. I'd be interested to know how we generate such dispatches in flow. Did you perform layout transformations and handle transposes somehow causing this?
So I used the linalg IR I get from torch-mlir
to dump the dispatches using iree-compile
and find the "culprit" dispatch.
As @powderluv mentioned, we have pinned to an older version of torch-mlir
to avoid this issue.
There seems to be no special layout transformations or handling of transposes that we're using.
Can we confirm the IR difference between the two versions? And post both too?
The op count difference with OLD vs LATEST torch-mlir
:
At torch
level:
torch.aten.broadcast_to , 126 vs 120
torch.aten.mm , 1 vs 4
torch.aten.view , 81 vs 78
torch.prim.ListConstruct , 147 vs 146
At linalg
level:
linalg.batch_matmul , 5 vs 2
linalg.generic , 794 vs 790
linalg.matmul , 1 vs 4
linalg.yield , 844 vs 840
tensor.collapse_shape , 65 vs 66
tensor.empty , 39 vs 38
tensor.expand_shape , 126 vs 129
Elided Linalg IR which has the issue Elided Linalg IR which we have from the older version of torch-mlir
The main set of difference I see are the 3 "extra" matmuls we have in the newer IR (%94
, %97
and %100
) forming a matmul
+ expand_shape
+ generic
op set.
I tried experimenting with different torch
versions and found that the issue is specifically with torch==2.0.0.dev20230228
.
If we keep the latest torch-mlir
, but pin to older torch
versions, say torch==2.0.0.dev20230220
or torch==2.0.0.dev20230227
- the issue doesn't persist.
There are just a handful of decompositions we use in our pipeline as can be seen here.
So, I tried inspecting the decompositions which we're mainly using in our pipeline and the only "relevant" delta I see between torch==2.0.0.dev20230228
and torch==2.0.0.dev20230220
is fix embedding_backward_dense.
I tried reverting the one line change to see if it has any effect, but to no avail.
Thanks for the full input IR. However, I cannot reproduce the issue. With e2151d3beb14171e05fc05e4be168668c1f150dc and tools/iree-compile --iree-hal-target-backends=vulkan --iree-vulkan-target-triple=rdna2-unknown-windows --compile-to=flow
running on the new Linalg IR supposed to have the issue, I see 1) all the dispatch regions containing fused matmul and elementwise do not involve transpose as discovered above; 2) transpose linalg.generic
ops are in their own dispatch regions. Please double check that this is still a problem (maybe some transformation done on your own side causing it?).
I think this bug is magically fixed. I don't know what changed. We will revert the pin and keep an eye out and report back. Thank you for investigating.
yeah there is no magic :( I was testing via the wmma pipeline which worked. This failure only happens on the RDNA2/ SIMT pipeline. Re-opening.
adding @MaheshRavishankar too for any guidance since we think it may be something at the flow level.
@yzhang93 / @qedawkins could this be because the tunings changed or are now invalid since @antiagainst tried on top of master?
yup confirmed that the tunings we apply causes this crash at runtime. Maybe we can have runtime checks for them ? I don't know how the verifier let this go ? Maybe we need to enhance the verifier to capture this failure too.
The verifier might not be considering fused ops when determining shared memory requirements. Also I think it only verifies named matmuls (everything else just passes straight through) but I'd have to double check the verifier to be sure (am away from desk)
yup confirmed that the tunings we apply causes this crash at runtime. Maybe we can have runtime checks for them ? I don't know how the verifier let this go ? Maybe we need to enhance the verifier to capture this failure too.
I'm sure previously both tuned and untuned model had the failure. But maybe something has changed and untuned model works fine now. I think it's the VAE model and we don't apply lowering configs on it. I'll check if the Winograd transform caused the problem.
Not sure tuning is the problem--tuning just adjusts the tile sizes and such after seeing the dispatch region. The problem is having fused transposed elementwise op in the dispatch region from the beginning, which causes bufferization to insert extra allocations in shared memory.
It would be beneficial to understand why we are forming such dispatch regions, like doing --mlir-print-ir-after-all
to understand how such dispatch region is generated using Shark fork, as I cannot repro this with IREE top of the tree. (I assume its some combination of patterns causing this..) The solution would be either to avoid forming such dispatch regions (e.g., separating/having transpose in its own dispatch region), or teach bufferization to be smarter (not sure how feasible it is here).
Or actually maybe I'm not using the proper command-line option to reproduce the issue from the full model. I was using tools/iree-compile --iree-hal-target-backends=vulkan --iree-vulkan-target-triple=rdna2-unknown-windows
from the issue; but that's just for the single dispatch. I recall there are preprocessing steps. Let me know.
here are our typical flags
C:\Users\foo\AppData\Local\Temp\_MEI83882\iree\compiler\tools\..\_mlir_libs\iree-compile.exe - --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=C:\Users\foo\AppData\Local\Temp\_MEI83882\iree\compiler\tools\..\_mlir_libs\iree-lld.exe --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-windows
@antiagainst I just tested with the latest nightly IREE python package (https://github.com/openxla/iree/releases/tag/candidate-20230311.455) and the above dispatch still has the compilation error. And I confirmed there's nothing to do with tuning. I tested on my navi3 system with the following commands:
iree-compile module_forward_dispatch_28_vulkan_spirv_fb.mlir --iree-hal-target-backends=vulkan --iree-vulkan-target-triple=rdna2-unknown-linux -o test_28.vmfb
iree-run-module --module=test_28.vmfb --device=vulkan --function=forward_dispatch_28_matmul_4096x512x512
The above dispatch is from VAE model and the whole model failed with the same error. We use the preprocessing flags when compiling the whole model.
@yzhang93 can we please run with --mlir-print-ir-after-all
too
Here is the output of --mlir-print-ir-after-all
Sorry, I still cannot reproduce the issue. I used commit https://github.com/openxla/iree/commit/503d81229dcf53aa3f391866c3fa93231831b7bb (which is behind candidate-20230311.455) and the command line given in https://github.com/openxla/iree/issues/12523#issuecomment-1465059677 tools/iree-compile- --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-windows
on this input IR. This is what I get after compiling to Flow: https://gist.github.com/antiagainst/08758cde21ba798ac68d27b6016ddd67.
You can see that dispatch 28 is a transpose in its own dispatch region, not fused with linalg.matmul
. Actually if you search affine_map<(d0, d1) -> (d1, d0)>
, all such cases are in their own dispatch regions. OTOH, if you search linalg.matmul
, all linalg.generic
ops fused with them don't have the transposed cases like showed in the original report. Yes, module_forward_dispatch_28_vulkan_spirv_fb.mlir
alone is problematic, but I still don't know how it's produced..
The main VAE mlir is here
We could compile with iree-compile.exe -o vae.vmfb --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-windows
@Abhishek-Varma Could you please double check the new VAE mlir generated in SHARK? SHARK still fails with the same error, but @antiagainst and I cannot reproduce the error using the mlir you provided.
@antiagainst Could you please try this mlir?
Command lines:
iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-preprocessing-pad-linalg-ops{pad-size=32}))' vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o test.vmfb
iree-benchmark-module --module=test.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16
The error output:
[ 1] native hal.executable.create:0 -
[ 0] bytecode module.__init:2106 vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:665:12
at vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:18:3```
ok this seems like a codegen change that is breaking AMD but not Nvidia
I compiled the same code for rdna2 and it runs ok my 4090 card but fails on my AMD card.
(shark.venv) PS C:\g\SHARK\apps\stable_diffusion\web> iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-preprocessing-pad-linalg-ops{pad-size=32}))'
C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o test.vmfb
(shark.venv) PS C:\g\SHARK\apps\stable_diffusion\web> iree-benchmark-module --module=test.vmfb --function=forward --device=vulkan://1 --input=1x4x64x64xf16
D:\a\SHARK-Runtime\SHARK-Runtime\c\runtime\src\iree\hal\drivers\vulkan\native_executable.cc:157: UNAVAILABLE; VK_ERROR_INITIALIZATION_FAILED; while invoking native function hal.executable.create; while calling import;
[ 1] native hal.executable.create:0 -
[ 0] bytecode module.__init:2106 C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:665:12
at C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:18:3
(shark.venv) PS C:\g\SHARK\apps\stable_diffusion\web> iree-benchmark-module --module=test.vmfb --function=forward --device=vulkan://0 --input=1x4x64x64xf16
2023-03-13T13:07:56-07:00
Running C:\g\shark\shark.venv\Lib\site-packages\iree\runtime\scripts\iree_benchmark_module\..\..\iree-benchmark-module
Run on (64 X 2720.07 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x32)
L1 Instruction 32 KiB (x32)
L2 Unified 512 KiB (x32)
L3 Unified 16384 KiB (x8)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 523 ms 469 ms 1 items_per_second=1.91174/s
My test system may have an older driver so I am going to try with the latest one to make sure we are on the latest. But this has happened on end users systems with latest drivers
Both SCPC and LLPC seems to fail:
(shark.venv) PS C:\g\shark> rm *.vmfb
(shark.venv) PS C:\g\shark> iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-preprocessing-pad-linalg-ops{pad-size=32}))' C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o test.vmfb
(shark.venv) PS C:\g\shark> iree-benchmark-module --module=test.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16
D:\a\SHARK-Runtime\SHARK-Runtime\c\runtime\src\iree\hal\drivers\vulkan\native_executable.cc:157: UNAVAILABLE; VK_ERROR_INITIALIZATION_FAILED; while invoking native function hal.executable.create; while calling import;
[ 1] native hal.executable.create:0 -
[ 0] bytecode module.__init:2106 C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:665:12
at C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:18:3
(shark.venv) PS C:\g\shark> rm *.vmfb
(shark.venv) PS C:\g\shark> $env:AMD_ENABLE_LLPC=0
(shark.venv) PS C:\g\shark> iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-preprocessing-pad-linalg-ops{pad-size=32}))' C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o test.vmfb
(shark.venv) PS C:\g\shark> iree-benchmark-module --module=test.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16
D:\a\SHARK-Runtime\SHARK-Runtime\c\runtime\src\iree\hal\drivers\vulkan\native_executable.cc:157: UNKNOWN; VkResult=4294967283; while invoking native function hal.executable.create; while calling import;
[ 1] native hal.executable.create:0 -
[ 0] bytecode module.__init:2106 C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:665:12
at C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir:18:3
@antiagainst if this is a AMD driver issue is there a temporary way to override this dispatch 28 creation until it is fixed ?
Also should the dispatch_28 not be formed for those cards that expose less shared memory ?
How much is the shared memory usage?
it seems like the tile sizes chosen (which affects the shared memory usage) is not account for shared memory usage... So this is a backend issue.
@yzhang93 can we tune dispatch_28 for rdna2 so it doesn't crash ?
it seems like the tile sizes chosen (which affects the shared memory usage) is not account for shared memory usage... So this is a backend issue.
Backend here is the AMD vulkan driver ?
it seems like the tile sizes chosen (which affects the shared memory usage) is not account for shared memory usage... So this is a backend issue.
Backend here is the AMD vulkan driver ?
No, this should be the SPIR-V backend in IREE.
Thanks for the repro steps in https://github.com/openxla/iree/issues/12523#issuecomment-1466602501. I've seen the issue and figured out what went wrong and put up a fix in #12627.
With the above we won't fuse such cases. @powderluv or somebody else if you can help to verify this works that'd be nice.
Thank you. I can confirm it works ok
(shark.venv) PS C:\g\shark> iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-preprocessing-pad-linalg-ops{pad-size=32}))' C:\Users\anush\Downloads\vae_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o test.vmfb
(shark.venv) PS C:\g\shark> iree-benchmark-module --module=test.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16
2023-03-14T00:10:14-07:00
Running C:\g\shark\shark.venv\Lib\site-packages\iree\runtime\scripts\iree_benchmark_module\..\..\iree-benchmark-module
Run on (32 X 4491.57 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 1024 KiB (x16)
L3 Unified 32768 KiB (x2)
--------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 792 ms 0.000 ms 1 items_per_second=1.26315/s
@antiagainst When the right fix lands would we back to original memory usage? We are carrying this locally on SHARK-Runtime and some end users on 8GB cards are running out of VRAM.
Silly Q, but this issue is marked "Done" in status, but open in the IREE project (https://github.com/orgs/openxla/projects/13?pane=issue&itemId=22177652). Any idea why? (I can click "Close with Comment", but I guess I don't get the difference here.) Thanks!
@antiagainst When the right fix lands would we back to original memory usage? We are carrying this locally on SHARK-Runtime and some end users on 8GB cards are running out of VRAM.
I'm not sure this one would address memory usage issues. We were never able to handle such fusion cases before; we never hit such cases previously. So it's not like we are not fusing some previously fused cases. The memory usage issue is likely different.
Silly Q, but this issue is marked "Done" in status, but open in the IREE project (https://github.com/orgs/openxla/projects/13?pane=issue&itemId=22177652). Any idea why? (I can click "Close with Comment", but I guess I don't get the difference here.) Thanks!
Ha, interesting. This is not done; so I moved it back to "In Progress". The fix in https://github.com/openxla/iree/pull/12627 is not the long term way to go. I'll spend some time to do it more proper later.
Hi - double checking on this P0 issue. More to say? Ok to close or lower priority? Thanks!
Moving this to a P1 since we have a workaround (for SHARK at least)
P1 is ok, but this is a release blocker, I think. I foolishly started talking about things in discord, but cross-posting here.
Looking at https://github.com/openxla/iree/pull/12627 it seems like we have the option to drop a feature in order to fix the bug. I would advocate for that or hiding it behind a flag so that we don't have to wait for a big rewrite to fix this issue.
P1 is ok, but this is a release blocker, I think. I foolishly started talking about things in discord, but cross-posting here.
Looking at #12627 it seems like we have the option to drop a feature in order to fix the bug. I would advocate for that or hiding it behind a flag so that we don't have to wait for a big rewrite to fix this issue.
It works on other backend, and its just a can we kicked down the road for a while....
If this is indeed a release blocker then I think we need to revert the offending feature. This looks like we are miscompiling and we have head and unstable releases in a known-broken state.
I'm coming to fix this in the proper way next. The issue is triggered by some new IR patterns from torch-mlir which we didn't see before. So it's not that we have a regression---previous releases won't support it either. I don't want to blocking releasing on my implementation; so I'm fine rolling forward the release.
Got it, then I think this is not a release blocker. Thanks for clarifying (and for fixing :slightly_smiling_face: )
What happened?
On trying to pass the IR through
iree-run-module
, I get the following error :-This takes place for
--iree-vulkan-target-triple=rdna2-unknown-windows
Steps to reproduce your issue
Download module_forward_dispatch_28_vulkan_spirv_fb.mlir.
Step 1.
Step 2.
What component(s) does this issue relate to?
Compiler, Runtime
Version information
No response
Additional context
No response