Rust does not optimize register save/restore for early exit (slower than nim)

nokola commented 4 years ago

Note: This is a synthetic benchmark, however I believe it has the potential to improve runtime performance across the board for function calls because the push/pop register optimization in case of "early exit" seems like a low-hanging fruit. See assembly analysis below.

I used the following code

Rust 1.44.1: compiler options: -C opt-level=3 -C lto=fat

pub fn fib(n: u64) -> u64 {
  if n <= 1 { return 1 }
  fib(n - 1) + fib(n - 2)
}

fn main() {
  println!("{}", fib(48));
}

TotalSeconds : 12.3482073 TotalMilliseconds : 12348.2073 5% slower

Nim 1.2.0: compiler options: cpp -d:release --passC:-fno-inline-small-functions

proc fib(n: uint64): uint64 =
    if n <= 1: return 1
    return fib(n - 1) + fib(n - 2)

echo fib(48)

TotalSeconds : 11.7725467 TotalMilliseconds : 11772.5467

I expected to see this happen: Rust-generated assembly (see below) to only save and restore register if needed

Instead, this happened: Rust-generated assembly saves registers even in the "early exit" case, when n <= 1. Compared to nim's assembly, we see that the registers are not saved/restored for the early exit case.

Note: Given with the exponential count of the "early exit" case, this makes the benchmark perform significantly slower. However, I suspect even in real life we will benefit from having an early exit optimization.

Analysis: generated assembly and likely reason for slowdown

For the assembly analysis I use https://rust.godbolt.org/ with the above sample code and options. Here's the relevant assembly and source: Rust:

example::fib:
        push    r15
        push    r14
        push    rbx
        mov     r14d, 1    # function init

        cmp     rdi, 2      # if n <= 1
        jb      .LBB0_3     # return 1
        mov     rbx, rdi
        mov     r14d, 1
        mov     r15, qword ptr [rip + example::fib@GOTPCREL]
.LBB0_2:
        lea     rdi, [rbx - 1]
        call    r15
        add     rbx, -2
        add     r14, rax
        cmp     rbx, 1
        ja      .LBB0_2
.LBB0_3:
        mov     rax, r14
        pop     rbx
        pop     r14
        pop     r15
        ret

Nim:

proc fib(n: uint64): uint64 =
    if n <= 1: return 1
    return fib(n - 1) + fib(n - 2)

echo fib(48)

fib__Q9chlIZts06QqO9cyt9cUDmlg(unsigned long):
 cmp    rdi,0x1
 ja     20 <fib__Q9chlIZts06QqO9cyt9cUDmlg(unsigned long)+0x10>
 mov    eax,0x1
 ret    
 nop    DWORD PTR [rax+0x0]
 push   r12
 lea    r12,[rdi-0x3]
 push   rbp
 lea    rbp,[rdi-0x1]
 sub    rdi,0x2
 push   rbx
 and    rdi,0xfffffffffffffffe
 xor    ebx,ebx
 sub    r12,rdi
 mov    rdi,rbp
 sub    rbp,0x2
 call   45 <fib__Q9chlIZts06QqO9cyt9cUDmlg(unsigned long)+0x35>
 add    rbx,rax
 cmp    rbp,r12
 jne    39 <fib__Q9chlIZts06QqO9cyt9cUDmlg(unsigned long)+0x29>
 lea    rax,[rbx+0x1]
 pop    rbx
 pop    rbp
 pop    r12
 ret    
 nop    WORD PTR cs:[rax+rax*1+0x0]

Analysis:

Rust will always save registers (push on function entry, pop on exit) even for cases that doesn't need it. In contrast, nim's generated code will not.
This may or may not end up being beneficial in real-life scenarios, I suspect the optimization may be important.

nbdd0121 commented 4 years ago

Seems related to #73701

nokola commented 4 years ago

Seems related to #73701

Yes, potentially! Also, copy-pasting another example from reddit for more common scenario than above - Vec::push:

One realistic function that has a similar pattern is Vec::push: https://godbolt.org/z/gRHihN

The fast path executed in the common case where size < capacity is the following:
    push    rbp
    push    r14
    push    rbx
    mov     r14d, esi
    mov     rbx, rdi
    mov     rcx, qword ptr [rdi + 16]
    cmp     rcx, qword ptr [rdi + 8]
    jne     .LBB0_17
    ...
.LBB0_17: mov rax, qword ptr [rbx] mov dword ptr [rax + 4*rcx], r14d inc qword ptr [rbx + 16] pop rbx pop r14 pop rbp ret That is, the registers needed for reallocating the vector are pushed/popped even when it only needed to check that there is sufficient capacity, write the value to the new slot, and increment the size.

You would hope that most cases of Vec::push would be inlined, and the (unlikely) reallocation might be a function call, but the function bar in my example demonstrates this is not the case. It calls push in a loop, and that call is not inlined.

andjo403 commented 4 years ago

this shall probably be reported to llvm as this optimizations is not done for the c version of this function compiled with clang either see this godbolt one interesting thing is that is is done in clang for O1 but not higher maybe some conflict when the function is tail called optimized. the optimization is done for GCC

nokola commented 4 years ago

Nice find @andjo403! Do you have an LLVM bug account and are you interested in submitting the bug? I went to https://bugs.llvm.org/enter_bug.cgi, however waiting for my bug account to be opened by staff. If anyone can submit the LLVM bug before me please do!

andjo403 commented 4 years ago

have no LLVM bug account.

rust-lang / rust

Rust does not optimize register save/restore for early exit (slower than nim) #73849

Meta

Analysis: generated assembly and likely reason for slowdown

Analysis: