Loop unrolled on `x86_64` and `ppc64el` but not on `aarch64`, leading to poor performance

I tried this code:

#[inline(never)] // so that we can easily view the assembly
fn fibonacci_vec(length: usize) -> Vec<u64> {
    let mut fib = vec![0; length];
    {
        // Unlike a `Vec`, a slice is not resizable
        let fib = fib.as_mut_slice();
        if length > 1 {
            fib[1] = 1;
        }
        if length > 2 {
            // The compiler now knows that `fib` is fixed-size
            // and we are iterating exactly up to its length
            for i in 2..fib.len() {
                fib[i] = fib[i-1] + fib[i-2]; // no more bounds checks!
            }
        }
    }

    fib
}

pub fn main() {
    // read the length at runtime so that the compiler can't just precompute Fibonacci
    let arg = std::env::args().nth(1).expect("Please specify length");
    let length: usize = arg.parse().expect("That's not a number!");
    // actually call the function we care about
    let fibonacci = fibonacci_vec(length);
    // and print the result so that the compiler doesn't remove it as dead code
    println!("{:?}", fibonacci.last());
}

I expected loop unrolling to work uniformly on all 64-bit little-endian targets.

But on aarch64 the loop with the line fib[i] = fib[i-1] + fib[i-2]; is not getting unrolled, leading to significantly worse performance.

Assembly

I've inspected the assembly attributed to the fib[i] = fib[i-1] + fib[i-2]; line with https://github.com/pacak/cargo-show-asm to confirm. You can reproduce the results using:

cargo install cargo-show-asm
git clone https://github.com/Shnatsel/bounds-check-cookbook
cd bounds-check-cookbook
cargo asm --rust --bin fibvec_clever_indexing fibonacci_vec

Results on rustc 1.66 stable:

x86_64: https://gist.github.com/Shnatsel/5ae855298d8f35e9a9a513b044d8e294
ppc64el: https://gist.github.com/Shnatsel/862cbed11b3d88fcea31db1633433b05
aarch64 (not unrolled): https://gist.github.com/Shnatsel/9fcd5b220b8852531e3b0298c29912f4

Benchmarks

I've benchmarked each implementation against a naive version with bounds checks. To reproduce:

cargo install hyperfine
git clone https://github.com/Shnatsel/bounds-check-cookbook
cd bounds-check-cookbook
cargo build --release
hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000'

x86_64

measured on a Zen+ desktop CPU

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000'

Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      3.612 s ±  0.040 s    [User: 1.435 s, System: 2.132 s]
  Range (min … max):    3.546 s …  3.693 s    10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      3.133 s ±  0.019 s    [User: 0.995 s, System: 2.103 s]
  Range (min … max):    3.106 s …  3.163 s    10 runs

Summary
  'target/release/fibvec_clever_indexing 1000000000' ran
    1.15 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

ppc64el

measured on a POWER9 system from IntegriCloud

massive improvement, correlated with unrolling the loop more than on x86_64

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000'
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      2.853 s ±  0.052 s    [User: 2.149 s, System: 0.694 s]
  Range (min … max):    2.810 s …  2.980 s    10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      1.604 s ±  0.017 s    [User: 0.919 s, System: 0.681 s]
  Range (min … max):    1.586 s …  1.636 s    10 runs

Summary
  'target/release/fibvec_clever_indexing 1000000000' ran
    1.78 ± 0.04 times faster than 'target/release/fibvec_naive_indexing 1000000000'

Aarch64

Measured on an Ampere Altra VM from Google Cloud

very small improvement due to lack of loop unrolling

$ hyperfine 'target/release/fibvec_naive_indexing 1000000000' 'target/release/fibvec_clever_indexing 1000000000' 
Benchmark 1: target/release/fibvec_naive_indexing 1000000000
  Time (mean ± σ):      3.320 s ±  0.024 s    [User: 1.131 s, System: 2.179 s]
  Range (min … max):    3.263 s …  3.346 s    10 runs

Benchmark 2: target/release/fibvec_clever_indexing 1000000000
  Time (mean ± σ):      3.226 s ±  0.019 s    [User: 1.092 s, System: 2.127 s]
  Range (min … max):    3.209 s …  3.263 s    10 runs

Summary
  'target/release/fibvec_clever_indexing 1000000000' ran
    1.03 ± 0.01 times faster than 'target/release/fibvec_naive_indexing 1000000000'

Meta

Tested on rustc 1.66 stable installed from Rustup. All code built on the target natively, no cross-compilation.

The exact targets used are aarch64-unknown-linux-gnu, x86_64-unknown-linux-gnu and powerpc64le-unknown-linux-gnu.

To briefly summarise my current findings: the current aarch64 may tend to prevent loop unroll. This summary may not be accurate, so please allow me to elaborate further.

Corresponding changes on LLVM are: D97947, key changes are AArch64TargetTransformInfo.cpp.

Based on this change, we can perform loop unroll by adding a -mcpu, for example: rustc -C opt-level=3 -C target-cpu=cortex-a53 --target=aarch64-unknown-linux-gnu foo.rs.

A more visual example on godbolt.

No loop unroll: https://rust.godbolt.org/z/jYKxWfW51
loop unroll: https://rust.godbolt.org/z/8Pa561PTx

I am not an assembly expert and it may be easier for me to understand LLVM IR, later I use LLVM IR to show why it is relevant to the changes in D97947.

Here is a simpler example, the file fibonacci_vec.rs gives a similar result.

pub fn fibonacci_vec(fib: &mut [u64]) {
    // The compiler now knows that `fib` is fixed-size
    // and we are iterating exactly up to its length
    for i in 2..fib.len() {
        fib[i] = fib[i-1] + fib[i-2]; // no more bounds checks!
    }
}

The unoptimised IR can be obtained using the following command:

rustc --crate-type=lib --emit=llvm-ir -C opt-level=3 -C no-prepopulate-passes --target=aarch64-unknown-linux-gnu fibonacci_vec.rs

Next we can use opt alone to debug, using -opt-bisect-limit (https://rustc-dev-guide.rust-lang.org/backend/debugging.html) to get the IR before Loop Unroll.

; ModuleID = 'fibonacci_vec.ll'
source_filename = "fibonacci_vec.f1b28543-cgu.0"
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

; Function Attrs: nonlazybind uwtable
define void @_ZN13fibonacci_vec13fibonacci_vec17hb6b1caf079e597beE(ptr noalias nocapture noundef nonnull align 8 %fib.0, i64 %fib.1) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %0 = icmp ugt i64 %fib.1, 2
  br i1 %0, label %bb7.preheader, label %bb4

bb7.preheader:                                    ; preds = %start
  %uglygep = getelementptr i8, ptr %fib.0, i64 8
  %load_initial = load i64, ptr %uglygep, align 8
  br label %bb7

bb4:                                              ; preds = %bb7, %start
  ret void

bb7:                                              ; preds = %bb7.preheader, %bb7
  %store_forwarded = phi i64 [ %load_initial, %bb7.preheader ], [ %4, %bb7 ]
  %iter.sroa.0.05 = phi i64 [ 2, %bb7.preheader ], [ %1, %bb7 ]
  %1 = add nuw i64 %iter.sroa.0.05, 1
  %_18 = add i64 %iter.sroa.0.05, -2
  %2 = getelementptr inbounds [0 x i64], ptr %fib.0, i64 0, i64 %_18
  %_17 = load i64, ptr %2, align 8
  %3 = getelementptr inbounds [0 x i64], ptr %fib.0, i64 0, i64 %iter.sroa.0.05
  %4 = add i64 %_17, %store_forwarded
  store i64 %4, ptr %3, align 8
  %exitcond.not = icmp eq i64 %1, %fib.1
  br i1 %exitcond.not, label %bb4, label %bb7
}

; Function Attrs: nonlazybind uwtable
declare noundef i32 @rust_eh_personality(i32, i32 noundef, i64, ptr, ptr) unnamed_addr #0

attributes #0 = { nonlazybind uwtable "target-cpu"="generic" "target-features"="+outline-atomics,+v8a" }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}

My local command is: opt -S -O3 -opt-bisect-limit=334 fibonacci_vec.ll -o fibonacci_vec-before-LoopUnrollPass.ll.

Using the Debug version of opt you can get the following log:

$ opt -S -debug-only=loop-unroll,aarch64tti -passes=loop-unroll --disable-output fibonacci_vec-before-LoopUnrollPass.ll 
Loop Unroll: F[_ZN13fibonacci_vec13fibonacci_vec17hb6b1caf079e597beE] Loop %bb7
getProcFamily'
 isOutOfOrder0
  Loop Size = 7
  will not try to unroll loop with runtime trip count -unroll-runtime not given

If it is x86_64 you get:

Loop Unroll: F[_ZN20fibonacci_vec_common13fibonacci_vec17h1c0a61e13e879c1bE] Loop %bb7
  Loop Size = 8
  runtime unrolling with count: 4
  Exiting block %bb7: TripCount=0, TripMultiple=1, BreakoutTrip=1
Trying runtime unrolling on Loop: 
Loop at depth 1 containing: %bb7<header><latch><exiting>
Using epilog remainder.
UNROLLING loop %bb7 by 4 with run-time trip count!

Adding an mcpu for aarch64：

attributes #0 = { nonlazybind uwtable "target-cpu"="cortex-a53" "target-features"="+outline-atomics,+v8a" }

The log is:

Loop Unroll: F[_ZN13fibonacci_vec13fibonacci_vec17hb6b1caf079e597beE] Loop %bb7
getProcFamily'
 isOutOfOrder0
  Loop Size = 7
  runtime unrolling with count: 4
  Exiting block %bb7: TripCount=0, TripMultiple=1, BreakoutTrip=1
Trying runtime unrolling on Loop: 
Loop at depth 1 containing: %bb7<header><latch><exiting>
Using epilog remainder.
Unrolling remainder loop
  Exiting block %bb7.epil: TripCount=0, TripMultiple=1, BreakoutTrip=1
COMPLETELY UNROLLING loop %bb7.epil with trip count 3!
UNROLLING loop %bb7 by 4 with run-time trip count!

You can be sure that Loop Unroll was executed.

By comparing this with *TargetTransformInfo.cpp of other architectures, it can be determined that aarch64 has some additional judgements.

if (ST->getProcFamily() != AArch64Subtarget::Others &&
  !ST->getSchedModel().isOutOfOrder()) {
  UP.Runtime = true;
  // ...
}

There is more discussion of this branch condition in D97947, and I am not familiar with the specific optimizations possible for aarch64. This may not be an issue.

As it stands, we can do Loop Unroll via mcpu in addition to passing the following parameters.

rustc --crate-type=lib --emit=llvm-ir -C opt-level=3 -C llvm-args=-unroll-runtime --target=aarch64-unknown-linux-gnu fibonacci_vec.rs -o fibonacci_vec.unroll-runtime.ll

Example: https://rust.godbolt.org/z/hqb8PhMrq.

rust-lang / rust