llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.75k stars 11.89k forks source link

Ineffectual por with constant emitted for pshufb operand #106256

Closed cvijdea-bd closed 2 months ago

cvijdea-bd commented 2 months ago

Clang example: https://godbolt.org/z/ec4P4j78b, flags: -O3 -march=x86-64-v2. Not clang specific, same behaviour on rust nightly.

#include <immintrin.h>

extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) {
    return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112)));
}

The por of xmm1 with 112 (0b0111_0000) is a no-op and should be optimized out, as pshufb ignores bits 5-7 of the mask argument:

.LCPI0_0:
        .zero   16,112
shuffle_or:
        por     xmm1, xmmword ptr [rip + .LCPI0_0]
        pshufb  xmm0, xmm1
        ret

Writing _mm_shuffle_epi8(bytes, _mm_set1_epi8(127)) in the source emits a pshufb with 15 in the assembly, so it seems like LLVM is aware of this optimization on some level, but fails to apply it here.

llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-x86

Author: Cristian Vîjdea (cvijdea-bd)

Clang example: https://godbolt.org/z/jco9dn95W, flags: `-O3 -march=x86-64-v2` ```cpp #include <immintrin.h> extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) { return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112))); } ``` The `por` of xmm1 with 112 (`0b0111_0000`) is a no-op and should be optimized out, as pshufb ignores bits 5-7 of the mask argument: ```asm .LCPI0_0: .zero 16,112 shuffle_or: por xmm1, xmmword ptr [rip + .LCPI0_0] pshufb xmm0, xmm1 ret ``` <details> <summary>EDIT: simplified reproducer, original below</summary> Clang example: https://godbolt.org/z/r67EqKqK8, flags: `-O3 -march=x86-64-v2` ```cpp #include <immintrin.h> extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) { return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112))); } ``` The `por` of xmm1 with 112 (`0b0111_0000`) is a no-op, as pshufb ignores bits 5-7 of the mask argument: ```asm .LCPI0_0: .zero 16,112 shuffle_or: por xmm1, xmmword ptr [rip + .LCPI0_0] pshufb xmm0, xmm1 ret ``` </details>
cvijdea-bd commented 2 months ago

Thanks for looking into this so quickly!