RISC-V, clang, optimization: address calculation subtracts and adds the same value

Febbe commented 8 months ago

Having the code:

template<class To>
constexpr To copy_from_alligned_addr(uintptr_t src_addr) {
    return std::bit_cast<To>(*reinterpret_cast<To*>(src_addr)); // Relies on UB check binary output! clang and gnu may differ
}

constexpr void* builtin_stack_address(void* local = nullptr) noexcept { // Missing in clang :/
    auto sp = std::bit_cast<uintptr_t>(&local);  // Relies on UB check binary output! clang and gnu may differ
    sp = sp - 12;
    return std::bit_cast<void*>(sp);
}

static uintptr_t base = 0x50000; // Prevent consecutive calls of hack_me, to read unchanged values from the cache.
extern "C" void hack_me() noexcept {

    auto mybase = base;
    base += 0x100;
    auto active_ptr = mybase; 
    if (!copy_from_alligned_addr<bool>(active_ptr)) return;

    auto offset_addr = mybase + REGISTER_SIZE;
    auto data_addr = mybase + 2 * REGISTER_SIZE;
    auto sp_addr = std::bit_cast<uintptr_t>(builtin_stack_address());
    auto ra_p = uintptr_t(intptr_t(sp_addr) + copy_from_alligned_addr<intptr_t>(offset_addr));
    auto data = copy_from_alligned_addr<void*>(data_addr);
    *reinterpret_cast<void**>(ra_p) = data; // UB, since no compiler object is created @ra_p, optimisation to prevent 4-8 sb writes, assumes same alignment 
}

produces

00000148 <hack_me>:
 148:   ff010113                add     sp,sp,-16
 14c:   000205b7                lui     a1,0x20
 150:   0005a503                lw      a0,0(a1) # 20000 <_ZL4base>
 154:   10050613                add     a2,a0,256
 158:   00c5a023                sw      a2,0(a1)
 15c:   00054583                lbu     a1,0(a0)
 160:   0015f593                and     a1,a1,1
 164:   00058c63                beqz    a1,17c <hack_me+0x34>
 168:   00452583                lw      a1,4(a0)
 16c:   00852503                lw      a0,8(a0)
 170:   00c10613                add     a2,sp,12
 174:   00b605b3                add     a1,a2,a1
 178:   fea5aa23                sw      a0,-12(a1)
 17c:   01010113                add     sp,sp,16
 180:   00008067                ret

At address 0x170 12 is added to sp and stored in a2 It is subtracted in 0x178 again. A whole instruction could be saved here:

170: 00c10613 add a2,sp,a1

I assume this can happen more often in many other cases.

llvmbot commented 8 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Fabian Keßler (Febbe)

Having the code: ``` template<class To> constexpr To copy_from_alligned_addr(uintptr_t src_addr) { return std::bit_cast<To>(*reinterpret_cast<To*>(src_addr)); // Relies on UB check binary output! clang and gnu may differ } constexpr void* builtin_stack_address(void* local = nullptr) noexcept { // Missing in clang :/ auto sp = std::bit_cast<uintptr_t>(&local); // Relies on UB check binary output! clang and gnu may differ sp = sp - 12; return std::bit_cast<void*>(sp); } static uintptr_t base = 0x50000; // Prevent consecutive calls of hack_me, to read unchanged values from the cache. extern "C" void hack_me() noexcept { auto mybase = base; base += 0x100; auto active_ptr = mybase; if (!copy_from_alligned_addr<bool>(active_ptr)) return; auto offset_addr = mybase + REGISTER_SIZE; auto data_addr = mybase + 2 * REGISTER_SIZE; auto sp_addr = std::bit_cast<uintptr_t>(builtin_stack_address()); auto ra_p = uintptr_t(intptr_t(sp_addr) + copy_from_alligned_addr<intptr_t>(offset_addr)); auto data = copy_from_alligned_addr<void*>(data_addr); *reinterpret_cast<void**>(ra_p) = data; // UB, since no compiler object is created @ra_p, optimisation to prevent 4-8 sb writes, assumes same alignment } ``` produces ``` 00000148 <hack_me>: 148: ff010113 add sp,sp,-16 14c: 000205b7 lui a1,0x20 150: 0005a503 lw a0,0(a1) # 20000 <_ZL4base> 154: 10050613 add a2,a0,256 158: 00c5a023 sw a2,0(a1) 15c: 00054583 lbu a1,0(a0) 160: 0015f593 and a1,a1,1 164: 00058c63 beqz a1,17c <hack_me+0x34> 168: 00452583 lw a1,4(a0) 16c: 00852503 lw a0,8(a0) 170: 00c10613 add a2,sp,12 174: 00b605b3 add a1,a2,a1 178: fea5aa23 sw a0,-12(a1) 17c: 01010113 add sp,sp,16 180: 00008067 ret ``` At address `0x170` `12` is added to `sp` and stored in a2 It is subtracted in `0x178` again. A whole instruction could be saved here: `170: 00c10613 add a2,sp,a1` I assume this can happen more often in many other cases.

topperc commented 8 months ago

I suspect the use of uintptr_t, bitcasts, and the UB prevented us from reasoning about the pointer arithmetic in a straightforward way. The 12 for the add a2,sp,12 was only known after register allocation and stack framelayout. There are no optimization passes that can remove redundant computation after that point.

Febbe commented 8 months ago

I suspect the use of uintptr_t, bitcasts, and the UB prevented us from reasoning about the pointer arithmetic in a straightforward way. The 12 for the add a2,sp,12 was only known after register allocation and stack framelayout. There are no optimization passes that can remove redundant computation after that point.

Isn't that a thing that should be done at LLVM IR level?

jrtc27 commented 8 months ago

I suspect the use of uintptr_t, bitcasts, and the UB prevented us from reasoning about the pointer arithmetic in a straightforward way. The 12 for the add a2,sp,12 was only known after register allocation and stack framelayout. There are no optimization passes that can remove redundant computation after that point.

Isn't that a thing that should be done at LLVM IR level?

Stack frame layout happens after (most of) instruction selection, LLVM IR is long gone.

(I say "most of" because it's a bit fuzzy; it happens after what LLVM calls ISel, but given pseudos and things we haven't quite decided on the exact instructions yet, so depends what definition you use)

topperc commented 8 months ago

This is what we had immediatley after isel. Note this was riscv64 because that's the toolchain I had available.

*** MachineFunction at end of ISel ***
# Machine code for function hack_me: IsSSA, TracksLiveness
Frame Objects:
  fi#0: size=8, align=8, at location [SP]

bb.0.entry:
  successors: %bb.2(0x30000000), %bb.1(0x50000000); %bb.2(37.50%), %bb.1(62.50%)

  %1:gpr = PseudoLLA @_ZL4base
  %0:gpr = LD %1:gpr, 0 :: (dereferenceable load (s64) from @_ZL4base, !tbaa !9) 
  %2:gpr = ADDI %0:gpr, 256
  SD killed %2:gpr, %1:gpr, 0 :: (store (s64) into @_ZL4base, !tbaa !9)
  %3:gpr = LBU %0:gpr, 0 :: (load (s8) from %ir.1, !tbaa !13, !range !14)
  BEQ killed %3:gpr, $x0, %bb.2
  PseudoBR %bb.1

bb.1.if.end:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)

  LIFETIME_START %stack.0.local.addr.i
  LIFETIME_END %stack.0.local.addr.i
  %4:gpr = LD %0:gpr, 8 :: (load (s64) from %ir.4, !tbaa !13)
  %5:gpr = ADDI %stack.0.local.addr.i, 0
  %6:gpr = ADD killed %5:gpr, killed %4:gpr
  %7:gpr = LD %0:gpr, 16 :: (load (s64) from %ir.6, !tbaa !13)
  SD killed %7:gpr, killed %6:gpr, -12 :: (store (s64) into %ir.8, !tbaa !16)

bb.2.cleanup:                                                                    
; predecessors: %bb.0, %bb.1                                                     

  PseudoRET                                                                      

# End machine code for function hack_me

The %5:gpr = ADDI %stack.0.local.addr.i, 0 means replace this with the computation of the address of %stack.0.local.addr.i when we eventually know it. And we can't know it until after register allocation after any spill slots are created.

Febbe commented 8 months ago

Very interesting, is there even any benefit of adding an optimization pass after the stack frame layout / register allocation?

Btw. I found a solution I dismissed 6h ago for a now unknown reason, and it does not mess around with register calculation:

static void* asm_builtin_stack_address() {
    void* res;
    asm("mv %0,sp":"=r" (res):);
    return res;
}

I think it was not inlined for some reason and produced the sp of asm_builtin_stack_address (sp-16).

llvm / llvm-project

RISC-V, clang, optimization: address calculation subtracts and adds the same value #82691