Straight while loop performs worse than javascript

hasenj commented 5 years ago

I'm filing this because I found this post on reddit where someone compared a simple function in both javascript and rust, and found the javascript to run twice faster than the rust version.

I decided to check how Odin fares in this, and indeed Odin was taking twice as long as the javascript console in Chrome.

I also tried to write it in C, and the C version was the fastest.

I think the Odin version should be performing on par with C.

Here's the javascript version:

function find_prime_iterative(initial) {
    let prime = 1; let curr = 1; let prime_count = 0;

    while (prime_count < initial) {
        for (let denom = 2; denom < curr; denom++) {
            if (curr % denom === 0) {
                curr = curr + 1;
                continue;
            }
        }
        prime = curr;
        curr = curr + 1;
        prime_count = prime_count + 1;
    }

    return prime;
}

let out = find_prime_iterative(20000)
console.log("result is:", out)

Here's output:

$ time node prim.js
result is: 112429

real    0m5.018s
user    0m4.969s
sys     0m0.028s

Here's the C version:

#include <stdio.h>

int find_prime_iterative (int initial) {
    int prime = 1;
    int curr = 1;
    int prime_count = 0;

    while (prime_count < initial) {
        for (int denom = 2; denom < curr; denom++) {
            if (curr % denom == 0) {
                curr = curr + 1;
                continue;
            }
        }
        prime = curr;
        curr = curr + 1;
        prime_count = prime_count + 1;
    }

    return prime;
}

int main() {
    int out = find_prime_iterative(20000);
    printf("result: %d\n", out);
    return 0;
}

Here's the timing:

$ cc prim.c

$ time ./a.out
result: 112429

real    0m3.588s
user    0m3.571s
sys     0m0.008s

Here's the odin version:

package main

import "core:fmt"

find_prime_iterative :: proc(initial: int) -> int {
    prime := 1;
    curr := 1;
    prime_count := 0;

    for prime_count < initial {
        for denom in 2..(curr - 1) {
            if curr % denom == 0 {
                curr = curr + 1;
                continue;
            }
        }
        prime = curr;
        curr = curr + 1;
        prime_count = prime_count + 1;
    }

    return prime;
}

main :: proc() {
    out := find_prime_iterative(20000);
    fmt.println("result:", out);
}

Here's the timing:

$ ~/code/Odin/odin build prim.odin

$ time ./prim
result: 112429

real    0m10.527s
user    0m10.420s
sys     0m0.033s

In summary

C version: ~ 3.5 seconds JS version: ~ 5.0 seconds Odin version: ~ 10.5 seconds

dotbmp commented 5 years ago

A couple of notes - JavaScript always has any optimizations on, whereas Odin, Rust and C need them turned on as options, so comparing with optimizations off may be an uneven comparison. C would gain some speed, Odin might have an even bigger bonus due to optimizing out any potentially naive LLVM IR (no offense @gingerBill 😛).

You're also using cc here which is most likely gcc, while Odin and Rust use LLVM as a backend. Of course having faster compilers is a legitimate advantage in C's corner, but comparing against Clang might even the playing field somewhat.

JavaScript itself could be benefiting somewhat from JIT optimizations like hot/colding the inner conditional, though generally speaking in real-world programs those benefits are small.

Also, referencing this comment thread on the original reddit post, the original code is actually written wrong and Rust beat out JavaScript when the issue was corrected - by removing the ranged for. Odin might be suffering a similar disadvantage here, and a similar correction might resolve the performance discrepancy in part or in whole.

If Odin is still doing poorly in comparison to C or, god forbid, JS with these things taken into consideration, then there's a ripe opportunity for optimization.

hasenj commented 5 years ago

I know the code here is bad. That doesn't matter. This issue is not about how to find primes in Odin.

What matters is you have a kind of loop written in such a way as to take a very long time, and when you run it in Javascript it runs a lot faster than Odin.

Notice that there's nothing special in the loop. Almost every thing in this function can be understood to match to an assembly instruction: increment number, take the mod of two numbers, compare them against number, loop.

No struct access, no memory allocations, no strings, no arrays. Nothing.

You're also using cc here which is most likely gcc

The 'cc' on my machine is LLVM I believe, because I'm on mac, and because:

$ cc --version
Apple LLVM version 10.0.0 (clang-1000.11.45.5)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Although I do think LLVM might be to blame here, because it does the same thing to both Odin and Rust, both which use the same LLVM backend.

JavaScript always has any optimizations on, whereas Odin, Rust and C need them turned on as options

Really? That's a weird statement to make. Javascript is pretty slow.

Your statement makes it sound like Javascript should be a really good choice of a programming language for people who care about performance.

JavaScript itself could be benefiting somewhat from JIT optimizations like hot/colding the inner conditional, though generally speaking in real-world programs those benefits are small.

If JIT was such a great thing we should stop writing anything in a low level language and just ride the node.js wave.

Like I said, this code is pretty straight forward and easy to translate to very simple assembly instructions by hand.

What could a JIT possibly be doing here?

Also notice the C version is faster than the JS version, so, really, the JIT is not a good excuse.

Sorry if I sound angry but I really find it hard to buy the argument that "JIT" can make managed code faster than native code.

The issue here is clearly the output of the compiler is very suboptimal machine code.

Also note, adding -opt=3 to odin does NOT improve the result at all. It just makes the compiler take longer.

refi64 commented 5 years ago

What could a JIT possibly be doing here?

FWIW both Node and PyPy have been observed to be faster than C on certain tight number loops, simply because the JIT can infer things about it that an AoT compiler can't

gingerBill commented 5 years ago

Okay, it is probably not slower at all but how you are timing it is flawed. Odin does have a bit of a shit startup for the runtime (initializing the runtime type info).

Time the following program and determine the difference:

package main

import "core:fmt"

main :: proc() {
    out: int = 123;
    fmt.println("result:", out);
}

vassvik commented 5 years ago

Odin startup is relatively fast, so that's not it here. Somewhere around a hundred ms at most.

hasenj commented 5 years ago

@gingerBill here's the timing for the blank file:

$ ~/code/odin/odin build blank.odin

$ time ./blank 
result: 123

real    0m0.005s
user    0m0.002s
sys     0m0.002s

As you see, odin's startup time doesn't really contribute anything here.

Okay, it is probably not slower at all but how you are timing it is flawed.

I might be making a mistake somewhere, who knows :)

But, the code is small. You don't have to take my word for it. Run the C program and time it yourself, then translate it to Odin yourself and time that.

gingerBill commented 5 years ago

@hasenj Sorry to be pedantic but when you say blank, how blank? Did you use my example code above or just an empty main? They are not at all the same and the minimum dependency system will make the empty main a trivial program.

I am not doubting you regarding the speed of your tests. I would like to make one change and that is regarding the inner for loop. To make it a fair comparison, I recommend the following change:

package main

import "core:fmt"

find_prime_iterative :: proc(initial: int) -> int {
    prime := 1;
    curr := 1;
    prime_count := 0;

    for prime_count < initial {
        for denom := 2; denom < curr; denom += 1 {
            if curr % denom == 0 {
                curr = curr + 1;
                continue;
            }
        }
        prime = curr;
        curr = curr + 1;
        prime_count = prime_count + 1;
    }

    return prime;
}

main :: proc() {
    out := find_prime_iterative(20000);
    fmt.println("result:", out);
}

I want to see if the issue is with the for in or just loops in general.

hasenj commented 5 years ago

The blank is your example, see the output 'result: 123'

Regarding your suggested change:

for denom := 2; denom <= curr; denom += 1 {

I did it this way and there was no change in the result.

Please note though, it should be denom < curr; because denom <= curr will loop way longer (possibly infinite loop?).

So, it doesn't matter whether the inner loop is this one

for denom := 2; denom < curr; denom += 1

or this one

for denom in 2..(curr - 1)

There's no visible change in the performance characteristics ..

gingerBill commented 5 years ago

<= was a typo. Also, it is good to know that the for in approach is not slower. So I wonder what is going on here.

Tetralux commented 5 years ago

Tested the C program on Windows x64 with both clang 6.0.1, and GCC 8.1.0.

With clang -O0, gcc -O0, and gcc -O3, the programs all ran in approximately 2 seconds.

In every other combination of the -O switch, with both clang and GCC, they all took the same amount of time as Odin; about 7 seconds.

I also tried Odin with -opt=0 ... no difference; 7 seconds.

Perhaps it's a bad optimization, or an interaction between two?

hasenj commented 5 years ago

For what it's worth I think I managed to extract the disassembly via lldb.

I have no idea how to read it, but I hope it can be useful to someone.

Here's the C version:


(lldb) disassemble
a.out`find_prime_iterative:
->  0x100000ec0 <+0>:   pushq  %rbp
    0x100000ec1 <+1>:   movq   %rsp, %rbp
    0x100000ec4 <+4>:   movl   %edi, -0x4(%rbp)
    0x100000ec7 <+7>:   movl   $0x1, -0x8(%rbp)
    0x100000ece <+14>:  movl   $0x1, -0xc(%rbp)
    0x100000ed5 <+21>:  movl   $0x0, -0x10(%rbp)
    0x100000edc <+28>:  movl   -0x10(%rbp), %eax
    0x100000edf <+31>:  cmpl   -0x4(%rbp), %eax
    0x100000ee2 <+34>:  jge    0x100000f49               ; <+137>
    0x100000ee8 <+40>:  movl   $0x2, -0x14(%rbp)
    0x100000eef <+47>:  movl   -0x14(%rbp), %eax
    0x100000ef2 <+50>:  cmpl   -0xc(%rbp), %eax
    0x100000ef5 <+53>:  jge    0x100000f2c               ; <+108>
    0x100000efb <+59>:  movl   -0xc(%rbp), %eax
    0x100000efe <+62>:  cltd   
    0x100000eff <+63>:  idivl  -0x14(%rbp)
    0x100000f02 <+66>:  cmpl   $0x0, %edx
    0x100000f05 <+69>:  jne    0x100000f19               ; <+89>
    0x100000f0b <+75>:  movl   -0xc(%rbp), %eax
    0x100000f0e <+78>:  addl   $0x1, %eax
    0x100000f11 <+81>:  movl   %eax, -0xc(%rbp)
    0x100000f14 <+84>:  jmp    0x100000f1e               ; <+94>
    0x100000f19 <+89>:  jmp    0x100000f1e               ; <+94>
    0x100000f1e <+94>:  movl   -0x14(%rbp), %eax
    0x100000f21 <+97>:  addl   $0x1, %eax
    0x100000f24 <+100>: movl   %eax, -0x14(%rbp)
    0x100000f27 <+103>: jmp    0x100000eef               ; <+47>
    0x100000f2c <+108>: movl   -0xc(%rbp), %eax
    0x100000f2f <+111>: movl   %eax, -0x8(%rbp)
    0x100000f32 <+114>: movl   -0xc(%rbp), %eax
    0x100000f35 <+117>: addl   $0x1, %eax
    0x100000f38 <+120>: movl   %eax, -0xc(%rbp)
    0x100000f3b <+123>: movl   -0x10(%rbp), %eax
    0x100000f3e <+126>: addl   $0x1, %eax
    0x100000f41 <+129>: movl   %eax, -0x10(%rbp)
    0x100000f44 <+132>: jmp    0x100000edc               ; <+28>
    0x100000f49 <+137>: movl   -0x8(%rbp), %eax
    0x100000f4c <+140>: popq   %rbp
    0x100000f4d <+141>: retq   
    0x100000f4e <+142>: nop

Here's the Odin version:

(lldb) disassemble 
prim`main.find_prime_iterative:
->  0x8160 <+0>:   movq   %rdi, -0x8(%rsp)
    0x8165 <+5>:   movq   $0x1, -0x18(%rsp)
    0x816e <+14>:  movq   $0x1, -0x28(%rsp)
    0x8177 <+23>:  movq   $0x0, -0x38(%rsp)
    0x8180 <+32>:  movq   -0x38(%rsp), %rax
    0x8185 <+37>:  cmpq   -0x8(%rsp), %rax
    0x818a <+42>:  setl   %cl
    0x818d <+45>:  andb   $0x1, %cl
    0x8190 <+48>:  testb  $0x1, %cl
    0x8193 <+51>:  je     0x821f                    ; <+191>
    0x8199 <+57>:  movq   $0x2, -0x48(%rsp)
    0x81a2 <+66>:  movq   -0x48(%rsp), %rax
    0x81a7 <+71>:  cmpq   -0x28(%rsp), %rax
    0x81ac <+76>:  setl   %cl
    0x81af <+79>:  andb   $0x1, %cl
    0x81b2 <+82>:  testb  $0x1, %cl
    0x81b5 <+85>:  je     0x81f4                    ; <+148>
    0x81b7 <+87>:  movq   -0x28(%rsp), %rax
    0x81bc <+92>:  cqto   
    0x81be <+94>:  idivq  -0x48(%rsp)
    0x81c3 <+99>:  cmpq   $0x0, %rdx
    0x81c7 <+103>: sete   %cl
    0x81ca <+106>: andb   $0x1, %cl
    0x81cd <+109>: testb  $0x1, %cl
    0x81d0 <+112>: je     0x81e2                    ; <+130>
    0x81d2 <+114>: movq   -0x28(%rsp), %rax
    0x81d7 <+119>: addq   $0x1, %rax
    0x81db <+123>: movq   %rax, -0x28(%rsp)
    0x81e0 <+128>: jmp    0x81e4                    ; <+132>
    0x81e2 <+130>: jmp    0x81e4                    ; <+132>
    0x81e4 <+132>: movq   -0x48(%rsp), %rax
    0x81e9 <+137>: addq   $0x1, %rax
    0x81ed <+141>: movq   %rax, -0x48(%rsp)
    0x81f2 <+146>: jmp    0x81a2                    ; <+66>
    0x81f4 <+148>: movq   -0x28(%rsp), %rax
    0x81f9 <+153>: movq   %rax, -0x18(%rsp)
    0x81fe <+158>: movq   -0x28(%rsp), %rax
    0x8203 <+163>: addq   $0x1, %rax
    0x8207 <+167>: movq   %rax, -0x28(%rsp)
    0x820c <+172>: movq   -0x38(%rsp), %rax
    0x8211 <+177>: addq   $0x1, %rax
    0x8215 <+181>: movq   %rax, -0x38(%rsp)
    0x821a <+186>: jmp    0x8180                    ; <+32>
    0x821f <+191>: movq   -0x18(%rsp), %rax
    0x8224 <+196>: retq   
    0x8225 <+197>: nop    
    0x8226 <+198>: nop    
    0x8227 <+199>: nop    
    0x8228 <+200>: nop    
    0x8229 <+201>: nop    
    0x822a <+202>: nop    
    0x822b <+203>: nop    
    0x822c <+204>: nop    
    0x822d <+205>: nop    
    0x822e <+206>: nop    
    0x822f <+207>: nop

gingerBill commented 5 years ago

It appears the Odin assembly actually has a bunch of redundant instructions (not the nops) but some of this is due to a hack I had to add because of LLVM.

I think the issue here is that LLVM uses i1 for its boolean types however, i1 is 1 bit wide, and I wanted 8 bit wide booleans for Odin. This means when i1 is needed, a conversion between bool (i8) and i1 is needed.

gingerBill commented 5 years ago

@hasenj Could you please retry this? I have done some minor adjustments to the IR generation code and there have been some major speed increases, especially for for loops.

dotbmp commented 5 years ago

@hasenj I dislike JavaScript as much as anybody but you shouldn't let hate blind you to reason. The modern JS VMs are extremely well-engineered and make the best they can of a bad language, and JIT does objectively provide optimizations in some circumstances that native programming does not, since code paths can be monitored, profiled, and edited at a machine-code level at runtime to take advantage of how the code is actually behaving in practice. None of this makes JavaScript a good choice for code that needs to be performant but when you're comparing the performance of one tiny piece of code, which could easily be an edge case, it's useful to understand the full context of the situation.

As it turns out fixing the code actually caused Rust's performance to win out over JS in the original post, so it's quite relevant that the code was incorrectly written. It's also not a 1:1 equivalency as Rust/Odin are using ranged for while JavaScript is using C-style for; as it turns out ranged for is about 3x slower than C-style for in Odin.

Of course none of this means that Odin and Rust don't have potential performance problems and room for improvement, and I said as much. I don't think anything in this thread warrants a reactionary backlash against either language for potentially being slower than another language in certain circumstances, which it turns out they are not, after all (Rust, at least).

dotbmp commented 5 years ago

Also, int in Odin is word-sized; so 64-bit in x64. Correct equivalency with the C code would be i32, or for the C code to instead use long long int or ssize_t or an equivalent.

hasenj commented 5 years ago

@gingerBill Sorry for the late response.

I did try the latest compiler but I don't see any noticeable change. The program still takes 10 seconds to run on my machine.

@bpunsky like I said, the C version beats the super smart JIT optimizer so this point is moot. Also, changing from "ranged" for to straight c-style for loop has no impact on performance. If I understand correctly, the odin compiler translates the ranged for loop to a straight c-style for loop.

While this sample code might be dumb, what it's doing is not unreasonable: looping for a long time to look for something. It seems 3x slower than it should be (compared to the C version). Imagine if the body of the loop was doing something more important.

I don't have enough "real world" experience to judge whether this matters or not, but my instinct says it does.

hasenj commented 5 years ago

@bpunsky wow, your last comment was spot on.

@gingerBill performance improved when I changed all numbers to i32. This code runs as fast as the C version (about 3 seconds on my machine).

The only thing I changed is explicitly make all integers of type i32

find_prime_iterative :: proc (initial: i32) -> i32 {
    prime : i32 = 1;
    curr : i32 = 1;
    prime_count : i32 = 0;

    for prime_count < initial {
        for denom : i32 = 2; denom < curr; denom += 1 {
            if (curr % denom == 0) {
                curr = curr + 1;
                continue;
            }
        }
        prime = curr;
        curr = curr + 1;
        prime_count = prime_count + 1;
    }

    return prime;
}

And, indeed, changing the C version to use ssize_t instead of int makes it take 10 seconds to run.

#include <stdio.h>

ssize_t find_prime_iterative (ssize_t initial) {
    ssize_t prime = 1;
    ssize_t curr = 1;
    ssize_t prime_count = 0;

    while (prime_count < initial) {
        for (ssize_t denom = 2; denom < curr; denom++) {
            if (curr % denom == 0) {
                curr = curr + 1;
                continue;
            }
        }
        prime = curr;
        curr = curr + 1;
        prime_count = prime_count + 1;
    }

    return prime;
}

int main() {
    ssize_t out = find_prime_iterative(20000);
    printf("result: %ld\n", out);
    return 0;
}

gingerBill commented 5 years ago

@hasenj That's really interesting. So it seems that the issue is nothing to do with Odin and just the type you use for the index value. This looks like some weird optimization thing as "register sized" integers should be as fast or faster than "non-register sized" ones.

hasenj commented 5 years ago

For completeness, here's the disassembly from the i32 odin version

(lldb) disassemble 
prim`main.find_prime_iterative:
->  0x74d0 <+0>:   movl   %edi, -0x8(%rsp)
    0x74d4 <+4>:   movl   $0x1, -0x18(%rsp)
    0x74dc <+12>:  movl   $0x1, -0x28(%rsp)
    0x74e4 <+20>:  movl   $0x0, -0x38(%rsp)
    0x74ec <+28>:  movl   -0x38(%rsp), %eax
    0x74f0 <+32>:  cmpl   -0x8(%rsp), %eax
    0x74f4 <+36>:  setl   %cl
    0x74f7 <+39>:  andb   $0x1, %cl
    0x74fa <+42>:  testb  $0x1, %cl
    0x74fd <+45>:  je     0x7570                    ; <+160>
    0x74ff <+47>:  movl   $0x2, -0x48(%rsp)
    0x7507 <+55>:  movl   -0x48(%rsp), %eax
    0x750b <+59>:  cmpl   -0x28(%rsp), %eax
    0x750f <+63>:  setl   %cl
    0x7512 <+66>:  andb   $0x1, %cl
    0x7515 <+69>:  testb  $0x1, %cl
    0x7518 <+72>:  je     0x754d                    ; <+125>
    0x751a <+74>:  movl   -0x28(%rsp), %eax
    0x751e <+78>:  cltd   
    0x751f <+79>:  idivl  -0x48(%rsp)
    0x7523 <+83>:  cmpl   $0x0, %edx
    0x7526 <+86>:  sete   %cl
    0x7529 <+89>:  andb   $0x1, %cl
    0x752c <+92>:  testb  $0x1, %cl
    0x752f <+95>:  je     0x753e                    ; <+110>
    0x7531 <+97>:  movl   -0x28(%rsp), %eax
    0x7535 <+101>: addl   $0x1, %eax
    0x7538 <+104>: movl   %eax, -0x28(%rsp)
    0x753c <+108>: jmp    0x7540                    ; <+112>
    0x753e <+110>: jmp    0x7540                    ; <+112>
    0x7540 <+112>: movl   -0x48(%rsp), %eax
    0x7544 <+116>: addl   $0x1, %eax
    0x7547 <+119>: movl   %eax, -0x48(%rsp)
    0x754b <+123>: jmp    0x7507                    ; <+55>
    0x754d <+125>: movl   -0x28(%rsp), %eax
    0x7551 <+129>: movl   %eax, -0x18(%rsp)
    0x7555 <+133>: movl   -0x28(%rsp), %eax
    0x7559 <+137>: addl   $0x1, %eax
    0x755c <+140>: movl   %eax, -0x28(%rsp)
    0x7560 <+144>: movl   -0x38(%rsp), %eax
    0x7564 <+148>: addl   $0x1, %eax
    0x7567 <+151>: movl   %eax, -0x38(%rsp)
    0x756b <+155>: jmp    0x74ec                    ; <+28>
    0x7570 <+160>: movl   -0x18(%rsp), %eax
    0x7574 <+164>: retq   
    0x7575 <+165>: nop    
    0x7576 <+166>: nop    
    0x7577 <+167>: nop    
    0x7578 <+168>: nop    
    0x7579 <+169>: nop    
    0x757a <+170>: nop    
    0x757b <+171>: nop    
    0x757c <+172>: nop    
    0x757d <+173>: nop    
    0x757e <+174>: nop    
    0x757f <+175>: nop

Tetralux commented 5 years ago

I'd be interested in a follow up on this to explain what was actually going on, assuming it's fixed, for future reference.

odin-lang / Odin

Straight while loop performs worse than javascript #309

In summary