[AArch64] Two large shifts and a combine can be a single combine and shift

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

27.05k stars 11.08k forks source link

[AArch64] Two large shifts and a combine can be a single combine and shift #59502

Open SamTebbs33 opened 1 year ago

SamTebbs33 commented 1 year ago

https://godbolt.org/z/hfT7h7vjP

#include <arm_neon.h>

/*
        ushr    v0.4s, v0.4s, #20
        ushr    v1.4s, v1.4s, #20
        uzp1    v0.8h, v0.8h, v1.8h
*/
uint16x8_t foo(uint32x4_t a, uint32x4_t b) {
    a = vshrq_n_u32(a, 20);
    b = vshrq_n_u32(b, 20);
    return vcombine_u16(vmovn_u32(a), vmovn_u32(b));
}

/*
        uzp2    v0.8h, v0.8h, v1.8h
        ushr    v0.8h, v0.8h, #4
*/
uint16x8_t bar(uint32x4_t a, uint32x4_t b) {
    uint16x4_t a_u16 = vshrn_n_u32(a, 16);
    uint16x4_t b_u16 = vshrn_n_u32(b, 16);
    uint16x8_t r = vcombine_u16(a_u16, b_u16);
    return vshrq_n_u16(r, 4);
}

The two shifts and combine in foo can be compiled to the same as bar. GCC also lacks this optimisation.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-aarch64

Unique-Usman commented 1 year ago

Hi, the two can be combined using the function below :-

#include <arm_neon.h>

uint16x8_t foo(uint32x4_t a, uint32x4_t b) {
    uint32x4_t shifted_a = vshrq_n_u32(a, 20); //rightshifting vector a 
    uint32x4_t shifted_b = vshrq_n_u32(b, 20); //rightshitfing vector b
    uint32x4x2_t combined = vzipq_u32(shifted_a, shifted_b); //combination of a and b to single 8-element strucute
    uint16x8_t result = vreinterpretq_u16_u32(combined.val[0]); //reinterpreted back to 16-bit vector 
    return result; //returning the result. 
}

SamTebbs33 commented 1 year ago

The idea is for the compiler to emit the instructions in the comment above bar when given the function foo.

ayushi-8102 commented 1 year ago

@SamTebbs33 I think it can be done like this :

#include <arm_neon.h>

uint16x8_t foo(uint32x4_t a, uint32x4_t b) {
    uint16x8_t r = vcombine_u16(vshrn_n_u32(a, 16), vshrn_n_u32(b, 16));
    return vshrq_n_u16(r, 4);
}

It will emit the following instruction :

Is it correct?

ayushi-8102 commented 1 year ago

@SamTebbs33 Please assign this issue to me

Unique-Usman commented 1 year ago

@ayushi-8102 , can we work on this together also?

ayushi-8102 commented 1 year ago

Actually , If my approach is correct then there is no need of working together as it is nearly resolved. @Unique-Usman

SamTebbs33 commented 1 year ago

The idea is for the compiler to emit the same instructions in foo as it does for bar, not for us to rewrite the functions, so this is still unresolved.

Madhupatel08 commented 4 months ago

Hi @SamTebbs33 can we discuss this problem. First lets break this question:-

Foo function performs a right shift by 20 bits on each element of the input vectors and then combines them into a single vector of 16-bit integers.
whereas the second function that is bar converts the input vectors to 16-bit integers, combines them, and then performs a right shift by 4 bits on each element of the combined vector.

The ask here is to generate the same output/instruction. what is meant by it?