Low level "bit-twiddling" performance...

GordonBGood commented 7 months ago

Description

There have been some performance benchmarks written that show that Roc is "in there" as to running a simple QuickSort on various amounts of random data in spite of being purely functional using persistent data structures due to doing mutation in-place when such mutation cannot be detected "from the outside"; however, a random sorting algorithm doesn't really push low level speeds as much of the time is taken by missed branch predictions, as can be seen by the fastest compared language (C++) being less than twice as fast as the slowest compared language (JavaScript).

To properly evaluate this, a benchmark that is only dependent on raw low-level CPU computation speed is more appropriate, and for this I use a version of odds-only Sieve of Eratosthenes over a bit buffer no larger than the CPU L1 cache, which benchmark runs in C/C++ in about 1.5 CPU clock cycles per culling operation on a modern CPU. I made my best attempt at this in Roc and it is about 20 times slower at 30 CPU clock cycles per operation.

As Roc doesn't have something called contiguous arrays, I checked the implementation of Roc List's and that is what they actually are, as implied in some documentation. I will provide both my Roc code and C code for this benchmark so it can be seen that they both do the same thing. Some will immediately comment that I should profile the program to identify the "hot spot", but it is obvious that his program spends almost all of its time in the innermost simple culling loop. Memory management is also not an issue here as the benchmark program allocates just one List to run the benchmark over at least hundreds of milliseconds.

I am new to Roc, and it is possible that I am missing some techniques for optimization or compiler environment that must be provided for performance.

My Testing Environment

Fedora 39 Linux x86_64 with the prerequisites as per the installation instructions.

The nightly build of Roc as of 21 Jan 2024 decompressed with the File extract tool.

The benchmark example is put into the Examples folder inside the extracted folder, and the Roc compiler is run inside that folder with the following command: "../roc build --optimize prime-bench.roc" in a terminal.

The program is then run from a terminal as follows: "./prime-bench".

For C, the program is build using GCC as per the following command from a terminal: "gcc -O3 -o prime-bench prime-bench.c"

The C-built program is then run in the same way as the Roc-built one is.

The Benchmark Codes

Note that for both codes, the culling is run 1000 times for timing accuracy:

Roc (EDITED_TO_CLEAN_UP_CODE):

app "prime-bench"
    packages { pf: "https://github.com/roc-lang/basic-cli/releases/download/0.7.1/Icc3xJoIixF3hCcfXrDwLCu4wQHtNdPyoJkEbkgIElA.tar.br" }
    imports [pf.Stdout, pf.Utc.{ now, deltaAsMillis }, pf.Task.{ await }]
    provides [main] to pf

primeBench : {} -> I32
primeBench = \ _ ->
    loop : I32, List U8 -> I32
    loop = \ n, cmpsts ->
        count : Nat, I32 -> I32
        count = \ i, sum ->
            if i >= 131072 then sum else
            elem = List.get cmpsts (i |> Num.shiftRightBy 3) |> Result.withDefault 0
            cnt = elem |> Num.shiftRightBy (i |> Num.bitwiseAnd 7 |> Num.intCast)
            count (i + 1) (sum - (cnt |> Num.bitwiseAnd 1 |> Num.intCast))
        if n == 0 then count 0 131073 else
        loopi = \ i, cmps ->
            s0 = (i + i) * (i + 3) + 3
            if s0 >= 131072 then cmps else
            elem = List.get cmps (i |> Num.shiftRightBy 3) |> Result.withDefault 0
            msk = 1 |> Num.shiftLeftBy (i |> Num.bitwiseAnd 7 |> Num.intCast)
            if (elem |> Num.bitwiseAnd msk) != 0 then loopi (i + 1) cmps else
            bp = i + i + 3
            slmt = s0 + bp * 8
            loops = \ s, cs ->
                if s >= slmt then cs else
                smsk = 1 |> Num.shiftLeftBy (s |> Num.bitwiseAnd 7 |> Num.intCast)
                updtf = \ a -> a |> Num.bitwiseOr smsk
                cull = \ c, cp ->
                    if c >= 16384 then cp else
                    cull (c + bp) (List.update cp c updtf)
                loops (s + bp) (cull (s |> Num.shiftRightBy 3) cs)
            loopi (i + 1) (loops s0 cmps)
        loop (n - 1) (loopi 0 cmpsts)
    loop 1000 (List.repeat 0u8 16384)

main : Task.Task {} e
main =
    start <- now |> await
    answr = primeBench {}
    stop <- now |> await
    elpsdStr = deltaAsMillis start stop |> Num.toStr
    "Found \(answr |> Num.toStr) primes to 262146 in \(elpsdStr) milliseconds."
      |> Stdout.line

The equivalent C code:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int primesBench() {
    unsigned char *cmpsts = malloc(16384);
    for (int i = 0; i < 16384; ++i) cmpsts[i] = 0;
    for (int n = 1; n <= 1000; ++n) {
        for (int i = 0; ; ++i) {
            int s0 = (i + i) * (i + 3) + 3;
            if (s0 >= 131072) break;
            if (cmpsts[i >> 3] & (1 << (i & 7))) continue;
            int bp = i + i + 3;
            int slmt = s0 + 8 * bp;
            for (int s = s0; s < slmt; s += bp) {
                unsigned char msk = 1 << (s & 7);
                for (int c = s >> 3; c < 16384; c += bp) cmpsts[c] |= msk;
            }
        }
    }
    int cnt = 131073;
    for (int i = 0; i < 131072; ++i) cnt -= cmpsts[i >> 3] >> (i & 7) & 1;
    free(cmpsts);
    return cnt;
}

int main() {
    clock_t start = clock();
    int answr = primesBench();
    double elpsd = (double)(clock() - start) / CLOCKS_PER_SEC * 1000;
    printf("Found %d primes up to 262146 in %f milliseconds.\r\n", answr, elpsd);
    return 0;
}

The Results

For Roc: Found 23000 primes to 262146 in 1648 milliseconds.

For C: Found 23000 primes up to 262146 in 68.081000 milliseconds.

Currently, (Jan 21, 2024), Roc is over twenty times slower than the equivalent C code.

These results were as run on a AMD 7840HS CPU running at 5.1 GHz, and as there are about 200,000 culling operations per culling pass, there are about 200 million culling operations in the benchmark, so it can be calculated that it took about 1.73 CPU clock cycles per operation for the C version and about 42 CPU clock cycles per operation for the Roc version.

Expected Results

Roc benchmark times more in line with those of C.

Comments

This may be due to optimizations not yet enabled in Roc; for instance, if the call to the List.update function in the culling loop were not inlined, it would explain much of the slow-down for Roc. As well there may be overflow and/or bounds checks that are not being elided by the compiler as being redundant which could account for much or the remaining discrepancy.

This benchmark may be useful to be added to the set of testing suites to check for performance progress and/or regressions.

GordonBGood commented 7 months ago

Adding another comparison, F#/Fable, (an almost pure FP language other than it has optional mutability of all bindings, always mutable array contents, and uncontrolled IO side-effects) can transpile to Rust code from F# code, with the following code implementing the same benchmark as the OP:

let cLOOPS = 1000

let primesBench() =
  let cmpsts = Array.zeroCreate 16384
  let rec loopn n =
    if n < cLOOPS then
      let rec loopi i =
        let s0 = (i + i) * (i + 3) + 3
        if s0 < 131072 then
          if cmpsts.[i >>> 3] &&& (1uy <<< (i &&& 7)) <> 0uy then loopi (i + 1) else
          let bp = i + i + 3
          let slmt = s0 + 8 * bp
          let rec loops s =
            if s < slmt then
              let msk = 1uy <<< (s &&& 7)
              let rec loopc c =
                if c < 16384 then
                  cmpsts.[c] <- cmpsts.[c] ||| msk
                  loopc (c + bp)
              loopc (s >>> 3); loops (s + bp)
          loops s0; loopi (i + 1)
      loopi 0; loopn (n + 1)      
  loopn 0
  {0 .. 131071} 
    |> Seq.fold
      (fun a i -> a - ((int cmpsts.[i >>> 3] >>> (i &&& 7)) &&& 1))
      131073

[<EntryPoint>]
let main args = 
    printfn "PrimesBench in F#"
    let start = System.DateTime.Now.Ticks
    let answr = primesBench()
    let elpsd = (System.DateTime.Now.Ticks - start) / 10000L
    printfn "Found %d primes to 262146 for %d loops in %d milliseconds." answr cLOOPS elpsd
    0

On the same computer as in the OP, this has the following run times:

Compiled to JavaScript (optimized): 197 milliseconds.
Compiled to DotNet IR (Release mode): 213 milliseconds.
Compiled to Rust (release/optimized): 169 milliseconds.

The above Rust results, which are a little faster than as run by JavaScript on Node.js or DotNet version 7.0 IR, are perhaps a better comparison to Roc as there is an automatic array bounds check per array access, which is one of the main reasons it is slower than the C version other than it also reloads the base array pointer on every inner culling loop (doesn't "lift" this outside the loop as it should). Using Rust's unsafe mutable pointers is not an option from Fable currently as it doesn't support the interop package that would allow that.

lukewilliamboswell commented 7 months ago

Have you ran this with roc build --optimize to ensure it is optimised using LLVM to run fast? On linux I think the dev backend is the current default which is very much not optimised generation.

I would suggest you share this analysis on zulip also, it looks really interesting. I always enjoy the resulting discussion on performance topics. 😃

GordonBGood commented 7 months ago

@lukewilliamboswell,

Have you ran this with roc build --optimize to ensure it is optimised using LLVM to run fast?

Yes, the run times quoted in the OP were all using optimized builds and without optimization it is about four time slower than this.

I assumed that Roc must be using LLVM for the optimized builds although the documentation didn't list LLVM as a pre-requisite. Fedora 39 currently uses LLVM 17 as its distro standard so that is what is installed.

bhansconnect commented 7 months ago

I am gonna make a guess that this has nothing to do with bit twiddling. Probably and issue with closures or some sort of recursion doing work it shouldn't.

Also, might just be a failure of opportunistic mutation.

bhansconnect commented 7 months ago

Can you pull the various loop functions into top level functions instead of nested functions and then test it? I am curious how that plays out.

GordonBGood commented 7 months ago

@bhansconnect,

I am gonna make a guess that this has nothing to do with bit twiddling.

Yes, I agree that it likely doesn't have to do with bit twiddling itself, just that bit-twiddling takes so very little time that it really shows up the overheads.

Probably and issue with closures or some sort of recursion doing work it shouldn't.

Well, all the the work is done in the innermost tiny loop closure, and that is barely a closure, only capturing the update function from outside its scope. I tried passing that update function in as an argument so the cull function isn't a closure, but no improvement...

Also, might just be a failure of opportunistic mutation.

Well, that would be your issue then, as the way the benchmark is constructed should produce in place mutation. I don't think it can be copying the whole array on every mutation, although I suppose it could be copying the array pointer more often than it should (as in never)...

If I remove the update function, it is fast although the answer is incorrect (of course)...

Is there an easy way to view the assembly emitted by LLVM?

GordonBGood commented 7 months ago

@bhansconnect,

Can you pull the various loop functions into top level functions instead of nested functions and then test it?

There isn't much chance that is the problem as it is pretty fast when I just remove the List.update from the inner "loop", but just to be sure I added the bp as an argument to really make it not a closure and moved it to outside the benchmark function with no change; I even added a type signature to no effect...

bhansconnect commented 7 months ago

Adding this to the issue. I think I have figured out what is going on.

Ok I have been digging into the IR and here are the two main problems from what I can see:

List.get is a regular function, not a low level (and we don't have borrow inference). As such, before every call to the function we increment the refcount and in the function we decrement the refcount (checking if we need to deallocate each time).
The same root cause as #6213: By having allocas in locations other than the entry block stops llvm from being able to inline List.get. LLVM is smart enough to realize that inlining that function with allocas in the wrong place would lead to a stack space explosion. As such, it refuses to inline the function unless the the allocas are fixed.

So I did some testing to verify. One, I updated and used my borrowing-hacks branch to avoid the first issue. Then I manually edited the llvm ir to fix the second issue. The result: Found 23000 primes to 262146 in 103 milliseconds.

Note: both changes are need for the perf.

No changes: 1.5s Just borrowing-hacks: 1.2s Just alloca fix: 1.0s both: 100ms

bhansconnect commented 1 month ago

I was just looking at this again.

On my machine, c takes 70ms and roc takes 98ms. I think this is close enough that we can close this issue.

Benchmark 1: /tmp/tmp-roc
  Time (mean ± σ):      97.8 ms ±   3.5 ms    [User: 95.4 ms, System: 1.1 ms]
  Range (min … max):    96.5 ms … 131.5 ms    100 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: /tmp/tmp-c
  Time (mean ± σ):      70.3 ms ±   0.4 ms    [User: 69.2 ms, System: 0.5 ms]
  Range (min … max):    69.9 ms …  72.3 ms    100 runs

bhansconnect commented 1 month ago

Oh wow, on my linux x86 machine it is even closer:

Benchmark 1: /tmp/tmp-c
  Time (mean ± σ):      74.8 ms ±   0.7 ms    [User: 74.1 ms, System: 0.9 ms]
  Range (min … max):    73.7 ms …  77.4 ms    100 runs

Benchmark 2: /tmp/tmp-gcc
  Time (mean ± σ):      81.9 ms ±   0.6 ms    [User: 81.2 ms, System: 0.9 ms]
  Range (min … max):    80.9 ms …  84.4 ms    100 runs

Benchmark 3: /tmp/tmp-roc
  Time (mean ± σ):      83.6 ms ±   0.7 ms    [User: 82.4 ms, System: 1.3 ms]
  Range (min … max):    82.5 ms …  86.3 ms    100 runs

Summary
  /tmp/tmp-c ran
    1.09 ± 0.01 times faster than /tmp/tmp-gcc
    1.12 ± 0.01 times faster than /tmp/tmp-roc

GordonBGood commented 1 month ago

Hi Brendan (@bhansconnect):

Oh wow, on my linux x86 machine it is even closer...

Yes, that looks "Good Enough For Me", with the small difference perhaps related to not yet eliding bounds checks?

Will this be immediately available to me to check on Roc nightlies?

Anton-4 commented 1 month ago

Will this be immediately available to me to check on Roc nightlies?

It's now available in a new TESTING release (see assets here). If you want to use e.g. basic-cli with the TESTING release, you'll need to clone the basic-cli repo and reference the platform locally. Note that platform building is different for these TESTING releases.

GordonBGood commented 1 month ago

It's now available in a new TESTING release...

Thanks @Anton-4, it works as advertised, and on my machine (AMD 7840HS at 5.1 GHz when single threaded), it runs about the same speed with Roc as the C version compiled with GCC and even a little faster than when compiled with Clang - right around 70 milliseconds for all...

roc-lang / roc

Low level "bit-twiddling" performance... #6425