Duplicate constants with arm neon mull2

uncleasm commented 2 years ago

When the same constant is needed both for low and high multiplications (or other low/high arithmetical operations?) the constant is suboptimally allocated both as an 8-byte and 16-byte variant, when the 16-byte variant would be enough.

uint8x16_t compare_ok(uint8x16_t x, uint8x16_t y, uint8x16_t X, uint8x16_t Y) {
    auto xx = vmull_u8(vget_low_u8(x), vget_low_u8(X));
    xx = vmlsl_u8(xx, vget_low_u8(y), vget_low_u8(Y));

    auto XX = vmull_u8(vget_high_u8(x), vget_high_u8(X));
    XX = vmlsl_u8(XX, vget_high_u8(y), vget_high_u8(Y));

    auto top_byte = vuzpq_u8(vreinterpretq_u8_u16(xx), vreinterpretq_u8_u16(XX)).val[1];
    return vshrq_n_u8(top_byte, 7);
}
        // optimal usage of registers
        umull   v4.8h, v0.8b, v2.8b
        umull2  v0.8h, v0.16b, v2.16b
        umlsl   v4.8h, v1.8b, v3.8b
        umlsl2  v0.8h, v1.16b, v3.16b
        uzp2    v0.16b, v4.16b, v0.16b
        ushr    v0.16b, v0.16b, #7

uint8x16_t compare_fail(uint8x16_t x, uint8x16_t y) {
    return compare_ok(x,y, vdupq_n_u8(33), vdupq_n_u8(119));
}
        // same constant is allocated both to 8-byte and 16-byte registers
        movi    v2.8b, #33
        movi    v5.8b, #119
        movi    v3.16b, #33
        movi    v4.16b, #119
        umull   v2.8h, v0.8b, v2.8b
        umull2  v0.8h, v0.16b, v3.16b
        umlsl   v2.8h, v1.8b, v5.8b
        umlsl2  v0.8h, v1.16b, v4.16b
        uzp2    v0.16b, v2.16b, v0.16b
        ushr    v0.16b, v0.16b, #7

uint8x16_t compare_fail(uint8x16_t x, uint8x16_t y, uint8x8_t coeffs) {
    return compare_ok(x,y, vdupq_lane_u8(coeffs, 0), vdupq_lane_u8(coeffs, 1));
}
        // same constant is allocated both to 8-byte and 16-byte registers
        dup     v3.8b, v2.b[0]
        dup     v4.16b, v2.b[0]
        dup     v5.8b, v2.b[1]
        dup     v2.16b, v2.b[1]
        umull   v3.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v4.16b
        umlsl   v3.8h, v1.8b, v5.8b
        umlsl2  v0.8h, v1.16b, v2.16b
        uzp2    v0.16b, v3.16b, v0.16b
        ushr    v0.16b, v0.16b, #7

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-aarch64

sunho commented 2 years ago

I want to try fixing this.

sunho commented 2 years ago

This is happening because vdup_16(x) is folded to vdup_8(x) by clang. And, for umull2, vdup_8 is expanded to vdup_16 by tryCombineLongOpWithDup so that umull2 pattern can be used, but after it's expanded, existing vdup_8 is in tact.

This can be solved by expanding vdup_8 in umull_low to get_low(vdup_16(x)) when tryCombineLongOpWithDup is applied. But, I'm not sure how can I fit this into combine pass. I need to somehow access "sibling node."

sunho commented 2 years ago

Here's my patch:

https://reviews.llvm.org/D118563

llvm / llvm-project

Duplicate constants with arm neon mull2 #53261