llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.32k stars 11.69k forks source link

[AArch64] Improve vector multiply by constant #54651

Open ilinpv opened 2 years ago

ilinpv commented 2 years ago

Multiplications by a constant can be synthesised using addition, subtraction and shifts to avoid expensive multiplication instructions. This can also be done on the vector side. AArch64 does not have a v2i64 vector multiply instruction, so cannot naively vectorizer this code:

#define N 256
void foo (unsigned long long *arr)
{
  for (int i = 0; i < N; i++)
    arr[i] *= 5;
}

GCC at -O3 produces vectorized code, by expanding the mul into shift+add:

foo(unsigned long*):
        add     x1, x0, 2048
.L2:
        ldr     q1, [x0]
        shl     v0.2d, v1.2d, 2
        add     v0.2d, v0.2d, v1.2d
        str     q0, [x0], 16
        cmp     x1, x0
        bne     .L2
        ret

LLVM currently does not vectorize (in the past it would vectorize and then scalarize, making a mess of the codegen):

foo:                                    // @foo
        mov     x8, xzr
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        add     x9, x0, x8
        add     x8, x8, #16
        cmp     x8, #2048
        ldp     x10, x11, [x9]
        add     x10, x10, x10, lsl #2
        add     x11, x11, x11, lsl #2
        stp     x10, x11, [x9]
        b.ne    .LBB0_1
        ret

https://godbolt.org/z/ssTvxKoGa

There are two parts to this ticket. The first is to teach the backend that a v2i64 multiple by a constant will be better expanded to a shift + mul than scalarizing it (https://godbolt.org/z/KMK71j5bh). decomposeMulByConstant should be able to help here.

The other part is to teach the cost model that v2i64 multiplies by certain constants are cheaper than the default cost of scalarizing the instruction. These costs should reflect the number of instructions the original multiply will be expanded into, based on the constant being multiplied.

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-aarch64

llvmbot commented 2 years ago

@llvm/issue-subscribers-good-first-issue

EugeneZelenko commented 2 years ago

@ps-19: See Label menu on Issues page.

ps-19 commented 2 years ago

occurrences of multiplication like "*=5 ", "*=10" in whole repository. Github & VScode searching is not giving accurate answer i tried to write an script but everything just messed up and doing manully it is almost impossible.

ilinpv commented 2 years ago

@ps-19 what is the purpose of searching this? It is just an example to reproduce the issue.

bcl5980 commented 2 years ago

It looks latest clang will full unroll the loop when N=256 I try to modify N to 512, it get

    movi    v0.4s, #5
    mov x8, xzr
.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
    add x9, x0, x8
    add x8, x8, #32
    cmp x8, #2048
    ldp q1, q2, [x9]
    mul v1.4s, v1.4s, v0.4s
    mul v2.4s, v2.4s, v0.4s
    stp q1, q2, [x9]
    b.ne    .LBB0_1

Much better but I'm not sure if we need to fold mul * 5 to shift+add or not

ilinpv commented 2 years ago

Much better but I'm not sure if we need to fold mul * 5 to shift+add or not

I think so, in general shift+add are less expensive.

EugeneZelenko commented 2 years ago

See https://llvm.org/docs/Phabricator.html for patch review process.

davidbolvansky commented 2 years ago

What you want to search in repo?

This issue is about improving codegen, not about replacing multiplications with adds and shifts in codebase.

You can start with aarch64 backend. But I am not sure if this is really beginner griendly task.

bcl5980 commented 2 years ago

Enable decomposeMulByConstant can do what you want. But I'm not sure if we can general enable this option or not.

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 47f922f33249..2a55b49d4934 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -1137,6 +1137,11 @@ private:

   bool isConstantUnsignedBitfieldExtractLegal(unsigned Opc, LLT Ty1,
                                               LLT Ty2) const override;
+
+  bool decomposeMulByConstant(LLVMContext &Context, EVT VT,
+                              SDValue C) const override {
+    return true;
+  }
 };

 namespace AArch64 {
davemgreen commented 2 years ago

It looks latest clang will full unroll the loop when N=256 I try to modify N to 512, it get

  movi    v0.4s, #5
  mov x8, xzr
.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
  add x9, x0, x8
  add x8, x8, #32
  cmp x8, #2048
  ldp q1, q2, [x9]
  mul v1.4s, v1.4s, v0.4s
  mul v2.4s, v2.4s, v0.4s
  stp q1, q2, [x9]
  b.ne    .LBB0_1

Much better but I'm not sure if we need to fold mul * 5 to shift+add or not

The type of a long changes between operating systems. On windows (LLP64) it will be an i32 - it will produce different code to linux (LP64) where it is a i64. https://godbolt.org/z/h1Y4nM9xa. In the original example the v2i64 that the vectorizer produces is scalarized as there is no 64bit multiply for NEON.

The cost is a little too low for the v2i64 mul - it shouldn't really be vectorizing (see D123007). But the point remains that if we do have a v2i64 multiply, using a shift/add will always be better than scalarizing into GPR registers.

Whether it was better or not for other types would be dependant on the relative performance of the various instructions, which might be cpu dependant.

MoZeWei commented 2 years ago

I am a little confused about the meaning of "trunk" in https://godbolt.org/z/axb1zrq5a. It seems that the "trunk" clang has fixed this but clang 11 hasn't. So if I want to start to learn llvm with this issue, I might better use llvm@11.0.0, right?

davemgreen commented 2 years ago

You can use a pragma to force vectorization in clang trunk: https://godbolt.org/z/3G3cf51be

MoZeWei commented 2 years ago

You can use a pragma to force vectorization in clang trunk: https://godbolt.org/z/3G3cf51be

Oh, fabulous! Thank you! I will try to dig in. Keep me updated.

simideveloper commented 1 year ago

hey is someone working on this issue or can I start working on it.

ilinpv commented 1 year ago

hey is someone working on this issue or can I start working on it.

Looks like no one, feel free to take it over

ipriyanshi1708 commented 1 year ago

Have someone fixed this issue? Or Can I work on this issue?

ilinpv commented 1 year ago

Have someone fixed this issue? Or Can I work on this issue?

The issue is reproduced on latest llvm trunk. Seems it is still open to work on.

ipriyanshi1708 commented 1 year ago

Have someone fixed this issue? Or Can I work on this issue?

The issue is reproduced on latest llvm trunk. Seems it is still open to work on.

Okay! Thanks

simideveloper commented 1 year ago

is it no more a beginner-friendly issue?

ilinpv commented 1 year ago

is it no more a beginner-friendly issue?

Beginner-friendly issues usually marked by label:"good first issue"

davemgreen commented 1 year ago

I've tried to update the description to help explain what is needed from this ticket. Some of the original codegen changed since it was first created, thanks to improvements in the costmodel.

ipriyanshi1708 commented 1 year ago

Hello! I have done some work on this issue but it is still not completed. Here it is: https://reviews.llvm.org/D159278