llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.63k stars 11.83k forks source link

[x86-64] Broadcasting an element of a vector should not use `vpermb` or `vpshufb` #113396

Open Validark opened 2 hours ago

Validark commented 2 hours ago

I have code like so:

export fn foo(x: @Vector(8, u8)) @Vector(64, u8) {
    return @splat(x[1]);
}

Here is the LLVM version:

define dso_local <64 x i8> @foo(<8 x i8> %0) local_unnamed_addr {
Entry:
  %1 = shufflevector <8 x i8> %0, <8 x i8> poison, <64 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  ret <64 x i8> %1
}

Here is how it lowers on Zen 5:

.LCPI0_1:
        .byte   1
foo:
        vpbroadcastb    zmm1, byte ptr [rip + .LCPI0_1]
        vpermb  zmm0, zmm1, zmm0
        ret

Here is how I think it should lower:

foo:
        vpsrlq  xmm0, xmm0, 8
        vpbroadcastb    zmm0, xmm0
        ret

Same applies to broadcasting into an xmm0:

.LCPI0_0:
        .zero   16,1
foo:
        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

I would much rather avoid the trip to memory:

foo:
        vpsrlq  xmm0, xmm0, 8
        vpbroadcastb    xmm0, xmm0
        ret
llvmbot commented 2 hours ago

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

I have code like so: ```zig export fn foo(x: @Vector(8, u8)) @Vector(64, u8) { return @splat(x[1]); } ``` Here is the LLVM version: ```llvm define dso_local <64 x i8> @foo(<8 x i8> %0) local_unnamed_addr { Entry: %1 = shufflevector <8 x i8> %0, <8 x i8> poison, <64 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> ret <64 x i8> %1 } ``` Here is how it lowers on Zen 5: ```asm .LCPI0_1: .byte 1 foo: vpbroadcastb zmm1, byte ptr [rip + .LCPI0_1] vpermb zmm0, zmm1, zmm0 ret ``` Here is how I think it should lower: ```asm foo: vpsrlq xmm0, xmm0, 8 vpbroadcastb zmm0, xmm0 ret ``` --- Same applies to broadcasting into an xmm0: ```asm .LCPI0_0: .zero 16,1 foo: vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] ret ``` I would much rather avoid the trip to memory: ```asm foo: vpsrlq xmm0, xmm0, 8 vpbroadcastb xmm0, xmm0 ret ```