Poor RV64 codegen for a sequence of additions with compile-time constants

michalt commented 1 year ago

We recently ran into an issue with poor scalar RV64 codegen: https://godbolt.org/z/fGjGn7rr1 Interestingly, the issue is not there when the offsets are not compile-time constants.

EDIT: See https://github.com/llvm/llvm-project/issues/69586#issuecomment-1773773510 for a better repro.

The repro is using SiFive's VCIX instructions, because that's how we noticed the issue, but I don't think the VCIX instructions are actually particularly important here. (Although they might just act as "barriers" to other optimizations/transformations, which might contribute to surfacing the actual program?)

@topperc had an initial look at this:

I don’t think I’ve seen this specific issue before. At least not to this extreme.

The issue appears to be that the middle end aggressively constant folded the getelementptr instructions in IR into unique operations like this
%6 = getelementptr inbounds i32, ptr %0, i64 128
%7 = tail call <vscale x 4 x i32> @llvm.riscv.vle.nxv4i32.i64(<vscale x 4 x i32> poison, ptr nonnull %6, i64 %4)
 %8 = getelementptr inbounds i32, ptr %0, i64 256
They are no longer in a chain, each is independent and some of them have large constants. It looks like at least some of the constants appear twice.

Each one of these getelementptrs is independently codegened to an add. If the constant is small enough, we’ll use an ADDI. Otherwise we’ll use an ADD and create the constant separately. Any constant that is used twice is selected to a sequence that materializes that constant, and is shared by call uses.

I don’t think scheduling has any real effect here.

At register allocation, we figure out we don’t have enough registers for all the large constants we need since they are used twice. Register allocation can try to rematerialize the constant creation instructions to avoid the spill, but that currently only works if creating the constant can be done in a single instruction with no other inputs. All of these constants, are two instructions, an LI+SLLI so they fail to rematerialize.

One easyish fix would be to create a pseudo instruction that represents LI+SLLI as a single instruction until after register allocation. That would allow register allocation to rematerialize it as a single “instruction". Then we could expand it to LI+SLLI after realloc and use the post-realloc schedule to move the LI and SLLI away from each other to allow the scalar pipe to dual issue with other instructions. I think this is roughly how ARM handles constant materialization.

Another option that would be better for code size, but more work to implement, is to write a new optimization pass that reconstructs a chain ADDs so we can use a series of ADDI 512 to construct each constant relative to the previous one. I thought the StraightLineStrengthReduce pass in LLVM might be able to do this. It’s not enabled by default, and my quick test didn’t show any improvement from using it.

Feel free to file a GitHub issue.

~Craig

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-risc-v

Author: Michal Terepeta (michalt)

We recently ran into an issue with poor scalar RV64 codegen: https://godbolt.org/z/fGjGn7rr1 Interestingly, the issue is not there when the offsets are not compile-time consntants. The repro is using SiFive's VCIX instructions, because that's how we noticed the issue, but I don't think the VCIX instructions are actually particularly important here. (Although they might just act as "barriers" to other optimizations/transformations, which might contribute to surfacing the actual program?) @topperc had an initial look at this: > I don’t think I’ve seen this specific issue before. At least not to this extreme. > >The issue appears to be that the middle end aggressively constant folded the getelementptr instructions in IR into unique operations like this > > ``` > %6 = getelementptr inbounds i32, ptr %0, i64 128 > %7 = tail call <vscale x 4 x i32> @llvm.riscv.vle.nxv4i32.i64(<vscale x 4 x i32> poison, ptr nonnull %6, i64 %4) > %8 = getelementptr inbounds i32, ptr %0, i64 256 > ``` > > They are no longer in a chain, each is independent and some of them have large constants. It looks like at least some of the constants appear twice. > >Each one of these getelementptrs is independently codegened to an add. If the constant is small enough, we’ll use an ADDI. Otherwise we’ll use an ADD and create the constant separately. Any constant that is used twice is selected to a sequence that materializes that constant, and is shared by call uses. > > I don’t think scheduling has any real effect here. > > At register allocation, we figure out we don’t have enough registers for all the large constants we need since they are used twice. Register allocation can try to rematerialize the constant creation instructions to avoid the spill, but that currently only works if creating the constant can be done in a single instruction with no other inputs. All of these constants, are two instructions, an LI+SLLI so they fail to rematerialize. > > One easyish fix would be to create a pseudo instruction that represents LI+SLLI as a single instruction until after register allocation. That would allow register allocation to rematerialize it as a single “instruction". Then we could expand it to LI+SLLI after realloc and use the post-realloc schedule to move the LI and SLLI away from each other to allow the scalar pipe to dual issue with other instructions. I think this is roughly how ARM handles constant materialization. > > Another option that would be better for code size, but more work to implement, is to write a new optimization pass that reconstructs a chain ADDs so we can use a series of ADDI 512 to construct each constant relative to the previous one. I thought the StraightLineStrengthReduce pass in LLVM might be able to do this. It’s not enabled by default, and my quick test didn’t show any improvement from using it. > > Feel free to file a GitHub issue. > > ~Craig

dzaima commented 1 year ago

Simpler repro, also happening on armv8: https://godbolt.org/z/5aGa5zYj4 (and, with larger constants, on x86_64 too - https://godbolt.org/z/eva138MGn, which gets especially weird as it actually factors out the adding of 10000000000 but then still spills & reloads the small integer additions to it)

topperc commented 1 year ago

I put up a patch https://github.com/llvm/llvm-project/pull/69983 to make the constant expansion rematerializable to avoid the spills. It's not enabled by default in the patch, but it can enable via command line for testing.

michalt commented 1 year ago

@topperc Cool, thanks! This does get rid of the spills/reloads, but AFAICS the generated code is still not nearly as good as when we avoid the compile-time constant.

With the workaround in https://godbolt.org/z/fGjGn7rr1 (making the constant into a runtime value, see the left output in godbolt), we build a sequence of add a0, a0, a2 (we just continue to increment the pointer by the offset stored in a2), so we only need a single instruction.
With your patch, we rematerialize the immediate, but that requires li & add combo for each (and each li is really lui + addi). So the # of instructions is not actually improved much, I think.

Would this be something that would be taken care by some subsequent pass? Or is your second suggestion about something similar to StraightLineStrengthReduce our best bet for approaching the codegen quality when using a runtime value?

llvm / llvm-project

Poor RV64 codegen for a sequence of additions with compile-time constants #69586