Open nokola opened 4 years ago
Seems related to #73701
Seems related to #73701
Yes, potentially!
Also, copy-pasting another example from reddit for more common scenario than above - Vec::push
:
One realistic function that has a similar pattern is Vec::push: https://godbolt.org/z/gRHihN
The fast path executed in the common case where size < capacity is the following:
push rbp push r14 push rbx mov r14d, esi mov rbx, rdi mov rcx, qword ptr [rdi + 16] cmp rcx, qword ptr [rdi + 8] jne .LBB0_17 ...
.LBB0_17: mov rax, qword ptr [rbx] mov dword ptr [rax + 4*rcx], r14d inc qword ptr [rbx + 16] pop rbx pop r14 pop rbp ret That is, the registers needed for reallocating the vector are pushed/popped even when it only needed to check that there is sufficient capacity, write the value to the new slot, and increment the size.
You would hope that most cases of Vec::push would be inlined, and the (unlikely) reallocation might be a function call, but the function bar in my example demonstrates this is not the case. It calls push in a loop, and that call is not inlined.
this shall probably be reported to llvm as this optimizations is not done for the c version of this function compiled with clang either see this godbolt one interesting thing is that is is done in clang for O1 but not higher maybe some conflict when the function is tail called optimized. the optimization is done for GCC
Nice find @andjo403! Do you have an LLVM bug account and are you interested in submitting the bug? I went to https://bugs.llvm.org/enter_bug.cgi, however waiting for my bug account to be opened by staff. If anyone can submit the LLVM bug before me please do!
have no LLVM bug account.
Note: This is a synthetic benchmark, however I believe it has the potential to improve runtime performance across the board for function calls because the push/pop register optimization in case of "early exit" seems like a low-hanging fruit. See assembly analysis below.
I used the following code
Rust 1.44.1: compiler options: -C opt-level=3 -C lto=fat
TotalSeconds : 12.3482073 TotalMilliseconds : 12348.2073 5% slower
Nim 1.2.0: compiler options: cpp -d:release --passC:-fno-inline-small-functions
TotalSeconds : 11.7725467 TotalMilliseconds : 11772.5467
I expected to see this happen: Rust-generated assembly (see below) to only save and restore register if needed
Instead, this happened: Rust-generated assembly saves registers even in the "early exit" case, when
n <= 1
. Compared to nim's assembly, we see that the registers are not saved/restored for the early exit case.Note: Given with the exponential count of the "early exit" case, this makes the benchmark perform significantly slower. However, I suspect even in real life we will benefit from having an early exit optimization.
Meta
rustc --version --verbose
:Analysis: generated assembly and likely reason for slowdown
For the assembly analysis I use https://rust.godbolt.org/ with the above sample code and options. Here's the relevant assembly and source: Rust:
Nim:
Analysis: