[AVX-512] LLVM is inconsistent on whether to move a routine to k-registers or not

Validark commented 4 weeks ago

I had this function:

export fn run_lengths1(bitstr: u64) @Vector(32, u8) {
    const ends = bitstr & ~(bitstr >> 1);
    const starts = bitstr & ~(bitstr << 1);

    const iota = std.simd.iota(u8, 64);

    // We narrow to just the first 32 values because 32 is the maximum popCount of ends/starts
    const end_positions = std.simd.extract(vpcompress(iota, undefined, ends), 0, 32);
    const start_positions = std.simd.extract(vpcompress(iota, undefined, starts), 0, 32);

    return end_positions - start_positions + @as(@Vector(32, u8), @splat(1));
}

Translates to this for Zen 4:

.LCPI0_0:
        .byte   0
        ; ... <iota vector of [0,63] here>
run_lengths1:
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI0_0]
        kmovq   k0, rdi
        kshiftrq        k1, k0, 1
        kaddq   k2, k0, k0
        kandnq  k1, k1, k0
        kandnq  k2, k2, k0
        vpcompressb     zmm1 {k1} {z}, zmm0
        vpcompressb     zmm0 {k2} {z}, zmm0
        vpsubb  ymm0, ymm1, ymm0
        vpcmpeqd        ymm1, ymm1, ymm1
        vpsubb  ymm0, ymm0, ymm1
        ret

So far so good. Looks like LLVM is moving bitstr over to a k register before executing the first two lines of my run_lengths1 procedure. However, when I compute bitstr:

export fn run_lengths2(a: u64, b: u64) @Vector(32, u8) {
    return run_lengths1(a | b);
}

It decides to do the computation in regular registers, then move the two outputs into k registers separately:

.LCPI1_0:
        .byte   0
        ; ... <iota vector of [0,63] here>
run_lengths2:
        vmovdqa64       zmm0, zmmword ptr [rip + .LCPI1_0]
        or      rdi, rsi
        mov     rax, rdi
        shr     rax
        lea     rcx, [rdi + rdi]
        andn    rax, rax, rdi
        andn    rcx, rcx, rdi
        kmovq   k1, rax
        vpcompressb     zmm1 {k1} {z}, zmm0
        kmovq   k1, rcx
        vpcompressb     zmm0 {k1} {z}, zmm0
        vpsubb  ymm0, ymm1, ymm0
        vpcmpeqd        ymm1, ymm1, ymm1
        vpsubb  ymm0, ymm0, ymm1
        ret

Godbolt link

This issue is about consistency. Why is LLVM not consistent in this decision?

Edit: removed incorrect information about Zen 5 latencies

topperc commented 4 weeks ago

Looks like the X86DomainReassignment pass failed on run_lengths2 for some reason, but the debug output doesn't say anything.

llvmbot commented 4 weeks ago

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

I had this function: ```zig export fn run_lengths1(bitstr: u64) @Vector(32, u8) { const ends = bitstr & ~(bitstr >> 1); const starts = bitstr & ~(bitstr << 1); const iota = std.simd.iota(u8, 64); // We narrow to just the first 32 values because 32 is the maximum popCount of ends/starts const end_positions = std.simd.extract(vpcompress(iota, undefined, ends), 0, 32); const start_positions = std.simd.extract(vpcompress(iota, undefined, starts), 0, 32); return end_positions - start_positions + @as(@Vector(32, u8), @splat(1)); } ``` Translates to this for Zen 4: ```asm .LCPI0_0: .byte 0 ; ... <iota vector of [0,63] here> run_lengths1: vmovdqa64 zmm0, zmmword ptr [rip + .LCPI0_0] kmovq k0, rdi kshiftrq k1, k0, 1 kaddq k2, k0, k0 kandnq k1, k1, k0 kandnq k2, k2, k0 vpcompressb zmm1 {k1} {z}, zmm0 vpcompressb zmm0 {k2} {z}, zmm0 vpsubb ymm0, ymm1, ymm0 vpcmpeqd ymm1, ymm1, ymm1 vpsubb ymm0, ymm0, ymm1 ret ``` So far so good. Looks like LLVM is moving `bitstr` over to a k register before executing the first two lines of my `run_lengths1` procedure. However, when I compute `bitstr`: ```zig export fn run_lengths2(a: u64, b: u64) @Vector(32, u8) { return run_lengths1(a | b); } ``` It decides to do the computation in regular registers, then move the two outputs into `k` registers separately: ```asm .LCPI1_0: .byte 0 ; ... <iota vector of [0,63] here> run_lengths2: vmovdqa64 zmm0, zmmword ptr [rip + .LCPI1_0] or rdi, rsi mov rax, rdi shr rax lea rcx, [rdi + rdi] andn rax, rax, rdi andn rcx, rcx, rdi kmovq k1, rax vpcompressb zmm1 {k1} {z}, zmm0 kmovq k1, rcx vpcompressb zmm0 {k1} {z}, zmm0 vpsubb ymm0, ymm1, ymm0 vpcmpeqd ymm1, ymm1, ymm1 vpsubb ymm0, ymm0, ymm1 ret ``` [Godbolt link](https://zig.godbolt.org/z/vaM76r3G7) I think on Zen 4, it's probably better to do this in `k` registers, but for Zen 5, it's likely better to stay in regular registers since all k-register operations have a latency of 2 cycles. But that's beside the point. This issue is about consistency. Why is LLVM not consistent in this decision?

Validark commented 3 weeks ago

Looks like the X86DomainReassignment pass failed on run_lengths2 for some reason, but the debug output doesn't say anything.

I have a modified version of the code which takes a number of bitstrs, calculates ends and starts for each of them, and tracks the bitwise OR of all ends and all starts. LLVM auto-vectorizes just one of those. I.e., it pays the (heavy and most likely not worthwhile) cost of moving the bitstrs over to a vector and does bitstr & ~(bitstr >> 1) but not bitstr & ~(bitstr << 1) in the vector (or vice versa). It does those one-by-one in general-purpose registers, even though it already paid the hefty cost to move the data to a vector. Is that probably the same issue as this one or should I open a new issue for that?

RKSimon commented 3 weeks ago

AFAICT X86DomainReassignment doesn't account for multiple uses of the mask intermediates, so the cost calculation just determines that moving 2 gprs to kmask isn't worth it, but it does this for both kmask transfers separately without realizing they share a lot of the costs.

Validark commented 2 weeks ago

AFAICT X86DomainReassignment doesn't account for multiple uses of the mask intermediates, so the cost calculation just determines that moving 2 gprs to kmask isn't worth it, but it does this for both kmask transfers separately without realizing they share a lot of the costs.

Opened an issue for this here, although this particular example might be a little tricky: https://github.com/llvm/llvm-project/issues/105763

llvm / llvm-project

[AVX-512] LLVM is inconsistent on whether to move a routine to k-registers or not #103498