[Aarch64] `clz` on a vector of 2 x u64 should be better optimized

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

Other

28.86k stars 11.92k forks source link

export fn clz(x: @Vector(2, u64)) @Vector(2, u64) {
    return @clz(x);
}

Gives me this emit for the Apple M3:

clz:
        ushr    v1.2d, v0.2d, #1
        orr     v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #2
        orr     v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #4
        orr     v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #8
        orr     v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #16
        orr     v0.16b, v0.16b, v1.16b
        ushr    v1.2d, v0.2d, #32
        orr     v0.16b, v0.16b, v1.16b
        mvn     v0.16b, v0.16b
        cnt     v0.16b, v0.16b
        uaddlp  v0.8h, v0.16b
        uaddlp  v0.4s, v0.8h
        uaddlp  v0.2d, v0.4s
        ret

I think it should do something like this:

export fn clz2(x: @Vector(2, u64)) @Vector(2, u64) {
    const clz_with_u32_granularity: @Vector(4, u32) = @clz(@as(@Vector(4, u32), @bitCast(x)));
    const base = @as(@Vector(2, u64), @bitCast(clz_with_u32_granularity)) >> @splat(32);

    const mask = @select(u32, @as(@Vector(4, u32), @bitCast(base)) == @as(@Vector(4, u32), @splat(32)), 
        clz_with_u32_granularity,
        @as(@Vector(4, u32), @splat(0)),
    );

    return base + @as(@Vector(2, u64), @bitCast(mask));
}

That gives us this assembly:

clz2:
        clz     v1.4s, v0.4s
        ushr    v0.2d, v1.2d, #32
        movi    v2.4s, #32
        cmeq    v0.4s, v0.4s, v2.4s
        and     v0.16b, v1.16b, v0.16b
        usra    v0.2d, v1.2d, #32
        ret

Alternatively, the usra could probably have been an add.

Assuming I didn't mess anything up, Z3 seems to prove this is a correct transformation? https://alive2.llvm.org/ce/z/878QXU

@llvm/issue-subscribers-backend-aarch64

Author: Niles Salter (Validark)

This code ([Godbolt link](https://zig.godbolt.org/z/4j538eG1P)): ```zig export fn clz(x: @Vector(2, u64)) @Vector(2, u64) { return @clz(x); } ``` Gives me this emit for the Apple M3: ```asm clz: ushr v1.2d, v0.2d, #1 orr v0.16b, v0.16b, v1.16b ushr v1.2d, v0.2d, #2 orr v0.16b, v0.16b, v1.16b ushr v1.2d, v0.2d, #4 orr v0.16b, v0.16b, v1.16b ushr v1.2d, v0.2d, #8 orr v0.16b, v0.16b, v1.16b ushr v1.2d, v0.2d, #16 orr v0.16b, v0.16b, v1.16b ushr v1.2d, v0.2d, #32 orr v0.16b, v0.16b, v1.16b mvn v0.16b, v0.16b cnt v0.16b, v0.16b uaddlp v0.8h, v0.16b uaddlp v0.4s, v0.8h uaddlp v0.2d, v0.4s ret ``` I think it should do something like this: ```zig export fn clz2(x: @Vector(2, u64)) @Vector(2, u64) { const clz_with_u32_granularity: @Vector(4, u32) = @clz(@as(@Vector(4, u32), @bitCast(x))); const base = @as(@Vector(2, u64), @bitCast(clz_with_u32_granularity)) >> @splat(32); const mask = @select(u32, @as(@Vector(4, u32), @bitCast(base)) == @as(@Vector(4, u32), @splat(32)), clz_with_u32_granularity, @as(@Vector(4, u32), @splat(0)), ); return base + @as(@Vector(2, u64), @bitCast(mask)); } ``` That gives us this assembly: ```asm clz2: clz v1.4s, v0.4s ushr v0.2d, v1.2d, #32 movi v2.4s, #32 cmeq v0.4s, v0.4s, v2.4s and v0.16b, v1.16b, v0.16b usra v0.2d, v1.2d, #32 ret ``` Alternatively, the `usra` could probably have been an `add`. Assuming I didn't mess anything up, Z3 seems to prove this is a correct transformation? https://alive2.llvm.org/ce/z/878QXU

llvm / llvm-project

[Aarch64] `clz` on a vector of 2 x u64 should be better optimized #109122