Open certik opened 2 years ago
I think we currently do not deallocate at the end, so one can also benchmark just the append cycle:
#include <vector>
#include <iostream>
#include <chrono>
int main() {
std::vector<int32_t> a = {0, 1, 2, 3, 4};
int32_t n = 100000000;
auto t1 = std::chrono::high_resolution_clock::now();
for (int32_t i = 0; i < n; i++) {
a.push_back(i + 5);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << a[n] << std::endl;
std::cout << "Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
return 0;
}
and timing:
$ clang++ -std=c++17 -Ofast a.cpp
$ time ./a.out
100000000
Time: 153
./a.out 0.07s user 0.10s system 93% cpu 0.177 total
So I think our benchmark above is probably quite solid, LPython seems faster on my computer.
I also tried Ubuntu 18.04 on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz:
Compiler | Time [s] | Relative |
---|---|---|
LPython 0.3.0-350-g6d29003ea --fast | 0.202 | 1.0 |
LPython 0.3.0-350-g6d29003ea | 0.456 | 2.26 |
g++ 7.5.0 -O3 -march=native -funroll-loops | 0.600 | 2.97 |
g++ 7.5.0 | 2.453 | 12.14 |
g++ 12.1.0 -O3 -march=native -funroll-loops | 0.654 | 3.24 |
g++ 12.1.0 | 5.026 | 24.88 |
Clang++ 14.0.6 -O3 -march=native -funroll-loops | 0.683 | 3.38 |
Python 3.10.2 | 7.443 | 36.85 |
Let's try without fast
mode? May be that will show if the speed up is due to our implementation or LLVM's optimisation algorithms.
I added results without --fast
for most compilers. The main speedup seems to be from how we (you!) implemented lists.
Great. Makes sense. If we are beating other compilers without fast
mode then we did the right thing for lists.
I also tried Ubuntu 18.04.6 on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz:
Compiler | Time [s] | Relative |
---|---|---|
LPython (list05) | 0.318 | 1.67 |
LPython 6d29003ea00e5dec4ee54 | 0.318 | 1.67 |
clang++ 6.0.0-1ubuntu2 | 1.755 | 9.23 |
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 | 1.800 | 9.47 |
LPython (list05) --fast | 0.183 | 0.96 |
LPython 6d29003ea00e5dec4ee54 --fast | 0.190 | 1.0 |
clang++ 6.0.0-1ubuntu2 -O3 -march=native -funroll-loops | 0.523 | 2.75 |
g++ 7.5.0 -O3 -march=native -funroll-loops | 0.513 | 2.7 |
Python 3.10.5 | 6.737 | 35.45 |
From Apple M1 Macbook Pro macOS Monterey 12.5
Compiler | Time [s] | Relative |
---|---|---|
LPython (list_ret) | 0.297 | 2.63 |
LPython (list05) | 0.302 | 2.67 |
LPython main | 0.302 | 2.67 |
LPython (list04) | 0.358 | 3.16 |
clang++ 11.1.0 arm64-apple-darwin21.6.0 | 3.021 | 26.73 |
LPython (list_ret) -- fast | 0.106 | 0.94 |
LPython (list05) --fast | 0.113 | 1.0 |
LPython main --fast | 0.113 | 1.0 |
clang++ 11.1.0 arm64-apple-darwin21.6.0 -O3 -march=native -funroll-loops | 0.183 | 1.62 |
Python 3.10.5 | 5.374 | 47.56 |
codon 0.15.5 (codon build -release -exe and time ./executable ) |
0.531 | 4.7 |
First 3 are normal mode compilation, second 3 are optimizations enabled and the last one is Python. So, all in all top is lpython --fast
, second is lpython
and then the rest.
Please, could someone possibly share how we are computing the Relative
value? Like Relative
with respect to?
Dividing with the smallest time from across all the results you have computed on your machine.
Benchmark on Apple M1 Air 2020 (Monterey):
Compiler | Time [s] | Relative |
---|---|---|
lpython --fast | 0.074 | 1.0 |
lpython | 0.284 | 3.837 |
g++ | 2.786 | 37.648 |
clang++ | 2.764 | 37.351 |
python 3.10.4 | 5.230 | 70.67 |
From the output of time a.out
, which parameter from real
, user
, system
do we need to consider/note?
real
parameter.
Result on Intel® Core™ i5-8250U CPU @ 1.60GHz × 8 (OS: Ubuntu 22.04 LTS) Compiler | Time [s] | Relative |
---|---|---|
lpython --fast | 0.184 | 1.0 |
lpython | 0.356 | 1.934 |
g++ -O3 -march=native | 0.574 | 3.119 |
clang++ -O3 -march=native | 0.578 | 3.141 |
g++ | 4.335 | 23.559 |
clang++ | 2.079 | 11.298 |
python 3.10.4 | 11.506 | 62.53 |
Machine: Mac Air M1(2020)
Compiler | Time [s] | Relative | Version |
---|---|---|---|
lpython --fast | 0.08 | 1.0 | 0.3.0-233-g9ae282a29-dirty |
g++ | 0.08 | 1.0 | 11.3.0 |
clang++ | 0.08 | 1.0 | 12.0.5 |
lpython | 0.26 | 3.25 | 0.3.0-233-g9ae282a29-dirty |
python | 7.83 | 97.87 | 3.10.2 |
Result on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx @1.600 GHz, Ubuntu 20.04.4 LTS Compiler | Time [s] | Relative |
---|---|---|
lpython --fast | 0.194 | 1.00 |
lpython | 0.427 | 2.20 |
g++ -O3 -march=native | 0.481 | 2.48 |
clang++ -O3 -march=native | 0.485 | 2.50 |
g++ | 2.0403 | 10.52 |
clang++ | 2.13 | 10.98 |
python 3.9.7 | 8.975 | 46.26 |
LPython version: 0.3.0-350-g6d29003ea
Platform: Linux
Default target: x86_64-unknown-linux-gnu
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Python 3.9.7
Numba benchmark:
from timeit import default_timer as clock
from numba import njit
@njit(nogil=True, cache=True)
def test_list():
a = [0, 1, 2, 3, 4]
n = 100000000
for i in range(n):
a.append(i + 5)
print(a[n])
test_list()
test_list()
test_list()
t1 = clock()
test_list()
t2 = clock()
print(t2-t1)
On my computer:
$ python b.py
100000000
100000000
100000000
100000000
0.2843555830186233
So it takes 0.28s.
Here is a simple benchmark for appending to a list in Python:
and C++:
Results on Apple M1 Max (I ran each benchmark many times, took the lowest numbers):
Versions:
Thanks @czgdp1807 for implementing lists in our LLVM backend (https://github.com/lcompilers/lpython/pull/835)! This is just a first implementation, but I already like the results. :)