AArch32 vld1q_dup_u16 produces dup.32 instead of dup.16 - Githubissues

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.24k stars 12.07k forks source link

AArch32 vld1q_dup_u16 produces dup.32 instead of dup.16 #58512

Open fbarchard opened 2 years ago

fbarchard commented 2 years ago

LLVM AArch32 produces incorrect instruction for vld1q_dup_u16 intrinsic with optimization turned on Reproducible snippet: https://godbolt.org/z/rEPG1Tzof

vld1q_dup_u16(&params->max)

Built with -O0

vld1.16 {d16[], d17[]}, [r0:16]

Built with -O2

ldrh    r3, [r3]
vmov.16 d16[0], r3
vdup.32 d18, d16[0]

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-arm

Maratyszcza commented 2 years ago

@lenary @davemgreen could you take a look? This is another miscompilation bug, probably related to #58327

davemgreen commented 2 years ago

The code doesn't seem optimal (the pattern gets split across basic blocks), but should now hopefully work correctly. Please let me know if this doesn't

Maratyszcza commented 2 years ago

vmov.16 d16[0], r3 leaves bits 16-31 unchanged (uninitialized), then vdup.32 d18, d16[0] copies those uninitialized bits into odd-numbered words of d18. I suspect vdup.32 should have been vdup.16.

Maratyszcza commented 2 years ago

@EugeneZelenko could you re-open this issue? This is an active miscompilation bug

davemgreen commented 2 years ago

Sorry - does https://github.com/llvm/llvm-project/commit/fb76d2ce6c4ece3a3d0360b75e5b01773159c400 not fix the issue you are reporting? Changing the godbolt link above from clang 13 to trunk causes it to change vdup.32 d18, d16[0] to vdup.16 d18, d16[0]. Is something else broken?

Maratyszcza commented 2 years ago

Thanks, https://github.com/llvm/llvm-project/commit/fb76d2ce6c4ece3a3d0360b75e5b01773159c400 does seem to fix the miscompilation issue. The issue of efficiency of the generated code remains.

Maratyszcza commented 2 years ago

Sorry for false alarm, I missed the commit fix and thought the bug was closed without a fix.

fbarchard commented 1 year ago

vld1q_dup_u16 with -O2 in the reproducible above is generating 3 instructions with clang truck (15)

        ldrh    r3, [r3]
        vmov.16 d16[0], r3
        vdup.16 d18, d16[0]  // inside loop

With -O0 it is

        vld1.16 {d16[], d17[]}, [r0:16]

The code functions but less efficiently than it should. If the code fragment is changed to 64 bit, it works as expected, but this is part of a larger function that uses 128 bit Neon.

The code is using a temporary GPR (r3) and a temporary Neon register (d16) and adding a dup inside the inner loop.