List append benchmark - Githubissues

certik commented 2 years ago

Here is a simple benchmark for appending to a list in Python:

from ltypes import i32

def test_list():
    a: list[i32] = [0, 1, 2, 3, 4]
    n: i32 = 100000000
    i: i32
    for i in range(n):
        a.append(i + 5)
    print(a[n])

test_list()

and C++:

#include <vector>
#include <iostream>

int main() {
    std::vector<int32_t> a = {0, 1, 2, 3, 4};
    int32_t n = 100000000;
    for (int32_t i = 0; i < n; i++) {
        a.push_back(i + 5);
    }
    std::cout << a[n] << std::endl;
    return 0;
}

Results on Apple M1 Max (I ran each benchmark many times, took the lowest numbers):

$ clang++ -std=c++17 -O3 a.cpp
$ time ./a.out
100000000
./a.out  0.07s user 0.09s system 94% cpu 0.170 total
$ g++ -O3 -march=native  a.cpp
$ time ./a.out
100000000
./a.out  0.06s user 0.09s system 93% cpu 0.164 total
$ lpython --fast a.py
$ time ./a.out
100000000
./a.out  0.05s user 0.03s system 93% cpu 0.089 total
$ time PYTHONPATH=../src/runtime/ltypes python a.py                 
100000000
PYTHONPATH=../src/runtime/ltypes python a.py  4.21s user 0.51s system 99% cpu 4.730 total

Compiler	Time	Relative
LPython	0.089	1.0
g++	0.164	1.84
Clang++	0.170	1.91
Python	4.730	53.15

Versions:

$ clang++ --version
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: arm64-apple-darwin21.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
$ g++ --version              
g++ (Spack GCC) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lpython --version
LPython version: 0.3.0-350-g6d29003ea
Platform: macOS ARM
Default target: arm64-apple-darwin21.3.0
$ python --version
Python 3.10.2

Thanks @czgdp1807 for implementing lists in our LLVM backend (https://github.com/lcompilers/lpython/pull/835)! This is just a first implementation, but I already like the results. :)

certik commented 2 years ago

I think we currently do not deallocate at the end, so one can also benchmark just the append cycle:

#include <vector>
#include <iostream>
#include <chrono>

int main() {
    std::vector<int32_t> a = {0, 1, 2, 3, 4};
    int32_t n = 100000000;
    auto t1 = std::chrono::high_resolution_clock::now();
    for (int32_t i = 0; i < n; i++) {
        a.push_back(i + 5);
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << a[n] << std::endl;
    std::cout << "Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
    return 0;
}

and timing:

$ clang++ -std=c++17 -Ofast a.cpp
$ time ./a.out
100000000
Time: 153
./a.out  0.07s user 0.10s system 93% cpu 0.177 total

So I think our benchmark above is probably quite solid, LPython seems faster on my computer.

certik commented 2 years ago

I also tried Ubuntu 18.04 on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz:

Compiler	Time [s]	Relative
LPython 0.3.0-350-g6d29003ea --fast	0.202	1.0
LPython 0.3.0-350-g6d29003ea	0.456	2.26
g++ 7.5.0 -O3 -march=native -funroll-loops	0.600	2.97
g++ 7.5.0	2.453	12.14
g++ 12.1.0 -O3 -march=native -funroll-loops	0.654	3.24
g++ 12.1.0	5.026	24.88
Clang++ 14.0.6 -O3 -march=native -funroll-loops	0.683	3.38
Python 3.10.2	7.443	36.85

czgdp1807 commented 2 years ago

Let's try without fast mode? May be that will show if the speed up is due to our implementation or LLVM's optimisation algorithms.

certik commented 2 years ago

I added results without --fast for most compilers. The main speedup seems to be from how we (you!) implemented lists.

czgdp1807 commented 2 years ago

Great. Makes sense. If we are beating other compilers without fast mode then we did the right thing for lists.

czgdp1807 commented 2 years ago

I also tried Ubuntu 18.04.6 on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz:

Compiler	Time [s]	Relative
LPython (list05)	0.318	1.67
LPython 6d29003ea00e5dec4ee54	0.318	1.67
clang++ 6.0.0-1ubuntu2	1.755	9.23
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0	1.800	9.47
LPython (list05) --fast	0.183	0.96
LPython 6d29003ea00e5dec4ee54 --fast	0.190	1.0
clang++ 6.0.0-1ubuntu2 -O3 -march=native -funroll-loops	0.523	2.75
g++ 7.5.0 -O3 -march=native -funroll-loops	0.513	2.7
Python 3.10.5	6.737	35.45

From Apple M1 Macbook Pro macOS Monterey 12.5

Compiler	Time [s]	Relative
LPython (list_ret)	0.297	2.63
LPython (list05)	0.302	2.67
LPython main	0.302	2.67
LPython (list04)	0.358	3.16
clang++ 11.1.0 arm64-apple-darwin21.6.0	3.021	26.73
LPython (list_ret) -- fast	0.106	0.94
LPython (list05) --fast	0.113	1.0
LPython main --fast	0.113	1.0
clang++ 11.1.0 arm64-apple-darwin21.6.0 -O3 -march=native -funroll-loops	0.183	1.62
Python 3.10.5	5.374	47.56
codon 0.15.5 (`codon build -release -exe` and `time ./executable`)	0.531	4.7

First 3 are normal mode compilation, second 3 are optimizations enabled and the last one is Python. So, all in all top is lpython --fast, second is lpython and then the rest.

Shaikh-Ubaid commented 2 years ago

Please, could someone possibly share how we are computing the Relative value? Like Relative with respect to?

czgdp1807 commented 2 years ago

Dividing with the smallest time from across all the results you have computed on your machine.

namannimmo10 commented 2 years ago

Benchmark on Apple M1 Air 2020 (Monterey):

Compiler	Time [s]	Relative
lpython --fast	0.074	1.0
lpython	0.284	3.837
g++	2.786	37.648
clang++	2.764	37.351
python 3.10.4	5.230	70.67

Shaikh-Ubaid commented 2 years ago

From the output of time a.out, which parameter from real, user, system do we need to consider/note?

czgdp1807 commented 2 years ago

real parameter.

Thirumalai-Shaktivel commented 2 years ago

Result on Intel® Core™ i5-8250U CPU @ 1.60GHz × 8 (OS: Ubuntu 22.04 LTS) Compiler	Time [s]	Relative
lpython --fast	0.184	1.0
lpython	0.356	1.934
g++ -O3 -march=native	0.574	3.119
clang++ -O3 -march=native	0.578	3.141
g++	4.335	23.559
clang++	2.079	11.298
python 3.10.4	11.506	62.53

Smit-create commented 2 years ago

Machine: Mac Air M1(2020)

Compiler	Time [s]	Relative	Version
lpython --fast	0.08	1.0	0.3.0-233-g9ae282a29-dirty
g++	0.08	1.0	11.3.0
clang++	0.08	1.0	12.0.5
lpython	0.26	3.25	0.3.0-233-g9ae282a29-dirty
python	7.83	97.87	3.10.2

Shaikh-Ubaid commented 2 years ago

Result on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx @1.600 GHz, Ubuntu 20.04.4 LTS Compiler	Time [s]	Relative
lpython --fast	0.194	1.00
lpython	0.427	2.20
g++ -O3 -march=native	0.481	2.48
clang++ -O3 -march=native	0.485	2.50
g++	2.0403	10.52
clang++	2.13	10.98
python 3.9.7	8.975	46.26

LPython version: 0.3.0-350-g6d29003ea
Platform: Linux
Default target: x86_64-unknown-linux-gnu

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Python 3.9.7

certik commented 2 years ago

Numba benchmark:

from timeit import default_timer as clock
from numba import njit

@njit(nogil=True, cache=True)
def test_list():
    a = [0, 1, 2, 3, 4]
    n = 100000000
    for i in range(n):
        a.append(i + 5)
    print(a[n])

test_list()

test_list()

test_list()

t1 = clock()
test_list()
t2 = clock()

print(t2-t1)

On my computer:

$ python b.py
100000000
100000000
100000000
100000000
0.2843555830186233

So it takes 0.28s.

lcompilers / lpython

List append benchmark #857