lcompilers / lpython

Python compiler
https://lpython.org/
Other
1.51k stars 164 forks source link

List append benchmark #857

Open certik opened 2 years ago

certik commented 2 years ago

Here is a simple benchmark for appending to a list in Python:

from ltypes import i32

def test_list():
    a: list[i32] = [0, 1, 2, 3, 4]
    n: i32 = 100000000
    i: i32
    for i in range(n):
        a.append(i + 5)
    print(a[n])

test_list()

and C++:

#include <vector>
#include <iostream>

int main() {
    std::vector<int32_t> a = {0, 1, 2, 3, 4};
    int32_t n = 100000000;
    for (int32_t i = 0; i < n; i++) {
        a.push_back(i + 5);
    }
    std::cout << a[n] << std::endl;
    return 0;
}

Results on Apple M1 Max (I ran each benchmark many times, took the lowest numbers):

$ clang++ -std=c++17 -O3 a.cpp
$ time ./a.out
100000000
./a.out  0.07s user 0.09s system 94% cpu 0.170 total
$ g++ -O3 -march=native  a.cpp
$ time ./a.out
100000000
./a.out  0.06s user 0.09s system 93% cpu 0.164 total
$ lpython --fast a.py
$ time ./a.out
100000000
./a.out  0.05s user 0.03s system 93% cpu 0.089 total
$ time PYTHONPATH=../src/runtime/ltypes python a.py                 
100000000
PYTHONPATH=../src/runtime/ltypes python a.py  4.21s user 0.51s system 99% cpu 4.730 total
Compiler Time Relative
LPython 0.089 1.0
g++ 0.164 1.84
Clang++ 0.170 1.91
Python 4.730 53.15

Versions:

$ clang++ --version
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ g++ --version              
g++ (Spack GCC) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lpython --version
LPython version: 0.3.0-350-g6d29003ea
Platform: macOS ARM
Default target: arm64-apple-darwin21.3.0
$ python --version
Python 3.10.2

Thanks @czgdp1807 for implementing lists in our LLVM backend (https://github.com/lcompilers/lpython/pull/835)! This is just a first implementation, but I already like the results. :)

certik commented 2 years ago

I think we currently do not deallocate at the end, so one can also benchmark just the append cycle:

#include <vector>
#include <iostream>
#include <chrono>

int main() {
    std::vector<int32_t> a = {0, 1, 2, 3, 4};
    int32_t n = 100000000;
    auto t1 = std::chrono::high_resolution_clock::now();
    for (int32_t i = 0; i < n; i++) {
        a.push_back(i + 5);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << a[n] << std::endl;
    std::cout << "Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
    return 0;
}

and timing:

$ clang++ -std=c++17 -Ofast a.cpp
$ time ./a.out
100000000
Time: 153
./a.out  0.07s user 0.10s system 93% cpu 0.177 total

So I think our benchmark above is probably quite solid, LPython seems faster on my computer.

certik commented 2 years ago

I also tried Ubuntu 18.04 on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz:

Compiler Time [s] Relative
LPython 0.3.0-350-g6d29003ea --fast 0.202 1.0
LPython 0.3.0-350-g6d29003ea 0.456 2.26
g++ 7.5.0 -O3 -march=native -funroll-loops 0.600 2.97
g++ 7.5.0 2.453 12.14
g++ 12.1.0 -O3 -march=native -funroll-loops 0.654 3.24
g++ 12.1.0 5.026 24.88
Clang++ 14.0.6 -O3 -march=native -funroll-loops 0.683 3.38
Python 3.10.2 7.443 36.85
czgdp1807 commented 2 years ago

Let's try without fast mode? May be that will show if the speed up is due to our implementation or LLVM's optimisation algorithms.

certik commented 2 years ago

I added results without --fast for most compilers. The main speedup seems to be from how we (you!) implemented lists.

czgdp1807 commented 2 years ago

Great. Makes sense. If we are beating other compilers without fast mode then we did the right thing for lists.

czgdp1807 commented 2 years ago

I also tried Ubuntu 18.04.6 on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz:

Compiler Time [s] Relative
LPython (list05) 0.318 1.67
LPython 6d29003ea00e5dec4ee54 0.318 1.67
clang++ 6.0.0-1ubuntu2 1.755 9.23
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 1.800 9.47
LPython (list05) --fast 0.183 0.96
LPython 6d29003ea00e5dec4ee54 --fast 0.190 1.0
clang++ 6.0.0-1ubuntu2 -O3 -march=native -funroll-loops 0.523 2.75
g++ 7.5.0 -O3 -march=native -funroll-loops 0.513 2.7
Python 3.10.5 6.737 35.45

From Apple M1 Macbook Pro macOS Monterey 12.5

Compiler Time [s] Relative
LPython (list_ret) 0.297 2.63
LPython (list05) 0.302 2.67
LPython main 0.302 2.67
LPython (list04) 0.358 3.16
clang++ 11.1.0 arm64-apple-darwin21.6.0 3.021 26.73
LPython (list_ret) -- fast 0.106 0.94
LPython (list05) --fast 0.113 1.0
LPython main --fast 0.113 1.0
clang++ 11.1.0 arm64-apple-darwin21.6.0 -O3 -march=native -funroll-loops 0.183 1.62
Python 3.10.5 5.374 47.56
codon 0.15.5 (codon build -release -exe and time ./executable) 0.531 4.7

First 3 are normal mode compilation, second 3 are optimizations enabled and the last one is Python. So, all in all top is lpython --fast, second is lpython and then the rest.

Shaikh-Ubaid commented 2 years ago

Please, could someone possibly share how we are computing the Relative value? Like Relative with respect to?

czgdp1807 commented 2 years ago

Dividing with the smallest time from across all the results you have computed on your machine.

namannimmo10 commented 2 years ago

Benchmark on Apple M1 Air 2020 (Monterey):

Compiler Time [s] Relative
lpython --fast 0.074 1.0
lpython 0.284 3.837
g++ 2.786 37.648
clang++ 2.764 37.351
python 3.10.4 5.230 70.67
Shaikh-Ubaid commented 2 years ago

From the output of time a.out, which parameter from real, user, system do we need to consider/note?

czgdp1807 commented 2 years ago

real parameter.

Thirumalai-Shaktivel commented 2 years ago
Result on Intel® Core™ i5-8250U CPU @ 1.60GHz × 8 (OS: Ubuntu 22.04 LTS) Compiler Time [s] Relative
lpython --fast 0.184 1.0
lpython 0.356 1.934
g++ -O3 -march=native 0.574 3.119
clang++ -O3 -march=native 0.578 3.141
g++ 4.335 23.559
clang++ 2.079 11.298
python 3.10.4 11.506 62.53
Smit-create commented 2 years ago

Machine: Mac Air M1(2020)

Compiler Time [s] Relative Version
lpython --fast 0.08 1.0 0.3.0-233-g9ae282a29-dirty
g++ 0.08 1.0 11.3.0
clang++ 0.08 1.0 12.0.5
lpython 0.26 3.25 0.3.0-233-g9ae282a29-dirty
python 7.83 97.87 3.10.2
Shaikh-Ubaid commented 2 years ago
Result on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx @1.600 GHz, Ubuntu 20.04.4 LTS Compiler Time [s] Relative
lpython --fast 0.194 1.00
lpython 0.427 2.20
g++ -O3 -march=native 0.481 2.48
clang++ -O3 -march=native 0.485 2.50
g++ 2.0403 10.52
clang++ 2.13 10.98
python 3.9.7 8.975 46.26
LPython version: 0.3.0-350-g6d29003ea
Platform: Linux
Default target: x86_64-unknown-linux-gnu

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Python 3.9.7
certik commented 2 years ago

Numba benchmark:

from timeit import default_timer as clock
from numba import njit

@njit(nogil=True, cache=True)
def test_list():
    a = [0, 1, 2, 3, 4]
    n = 100000000
    for i in range(n):
        a.append(i + 5)
    print(a[n])

test_list()

test_list()

test_list()

t1 = clock()
test_list()
t2 = clock()

print(t2-t1)

On my computer:

$ python b.py
100000000
100000000
100000000
100000000
0.2843555830186233

So it takes 0.28s.