Open qedawkins opened 1 week ago
Here is the llvm IR for the above example: https://gist.github.com/qedawkins/acce2625d09bac6caa51f53a304df9fe
Disabling the LoadStoreVectorizerPass appears to fix the issue: https://github.com/llvm/llvm-project/blob/6fcea431eed78f75e8ddb48e074c0078b93c109f/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp#L1230
Here is the llvm IR for the above example: https://gist.github.com/qedawkins/acce2625d09bac6caa51f53a304df9fe
Disabling the LoadStoreVectorizerPass appears to fix the issue: https://github.com/llvm/llvm-project/blob/6fcea431eed78f75e8ddb48e074c0078b93c109f/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp#L1230
@qedawkins can you share the .rocmasm files generated with and without the pass, probably the instruction in the error generating one has the same issue I found with the mixed_fma
Problem Description
The following IR
With inputs generated using the following numpy commands
Produces correct results on gfx1100 and gfx942 using this compile + run command
and incorrect results when adding
--iree-codegen-llvmgpu-test-tile-and-fuse-vectorize=true
on this branch: https://github.com/iree-org/iree/pull/18474Changing the llvm optimization level to
None
orLess
produces correct results when using the above flag: https://github.com/iree-org/iree/blob/c6056d197230161ea1403e88b5b8784d34e071a2/compiler/plugins/target/ROCM/ROCMTarget.cpp#L466Investigation
The IR generated immediately before lowering scf to control flow looks like the following:
(workgroup count is
[1, 1 1]
, i.e. single workgroup).Where it is simply looping over the reduction dims of the
conv_2d
and accumulating.%8
and%9
are the loads for the image and filters respectively. In the above sample inputs,%8
is always1
(np.ones), while%9
is broadcasted[1, 2, 1]
along the inner most dim, so the only index that affects the loaded value is%arg3
.Note that switching the input to be
[2, 1, 1]
broadcasted from the inner most dim changes the output to104
from88
, and using[1, 1, 2]
gives correct results, indicating that somehow the load for%arg3 = 1
somehow got replaced with a duplicate load to the first value. Additionally this only reproduces incorrect results if the input channel dimension (8
in this example) is >= 7. For smaller input channel dims this produces correct values.Additionally changing the input values for the image (
%8
) to be broadcasted[1, 2, 1]
and make the filter (%9
) uniform gives correct values, indicating that it is specifically the second load in this example that is getting mangled.