Open fbarchard opened 2 years ago
@llvm/issue-subscribers-backend-arm
@lenary @davemgreen could you take a look? This is another miscompilation bug, probably related to #58327
The code doesn't seem optimal (the pattern gets split across basic blocks), but should now hopefully work correctly. Please let me know if this doesn't
vmov.16 d16[0], r3
leaves bits 16-31 unchanged (uninitialized), then vdup.32 d18, d16[0]
copies those uninitialized bits into odd-numbered words of d18
. I suspect vdup.32
should have been vdup.16
.
@EugeneZelenko could you re-open this issue? This is an active miscompilation bug
Sorry - does https://github.com/llvm/llvm-project/commit/fb76d2ce6c4ece3a3d0360b75e5b01773159c400 not fix the issue you are reporting? Changing the godbolt link above from clang 13 to trunk causes it to change vdup.32 d18, d16[0]
to vdup.16 d18, d16[0]
. Is something else broken?
Thanks, https://github.com/llvm/llvm-project/commit/fb76d2ce6c4ece3a3d0360b75e5b01773159c400 does seem to fix the miscompilation issue. The issue of efficiency of the generated code remains.
Sorry for false alarm, I missed the commit fix and thought the bug was closed without a fix.
vld1q_dup_u16 with -O2 in the reproducible above is generating 3 instructions with clang truck (15)
ldrh r3, [r3]
vmov.16 d16[0], r3
vdup.16 d18, d16[0] // inside loop
With -O0 it is
vld1.16 {d16[], d17[]}, [r0:16]
The code functions but less efficiently than it should. If the code fragment is changed to 64 bit, it works as expected, but this is part of a larger function that uses 128 bit Neon.
The code is using a temporary GPR (r3) and a temporary Neon register (d16) and adding a dup inside the inner loop.
LLVM AArch32 produces incorrect instruction for vld1q_dup_u16 intrinsic with optimization turned on Reproducible snippet: https://godbolt.org/z/rEPG1Tzof
vld1q_dup_u16(¶ms->max)
Built with -O0
Built with -O2