Open certik opened 2 years ago
But we can do better --- the modulo operation n % i
is not optimal, it is currently using our Python implementation. The LLVM backend should intercept it and provide an optimize implementation. We might want to possibly move it into ASR itself.
Here is a C implementation:
#include <stdint.h>
#include <stdio.h>
int32_t is_prime(int32_t n) {
int32_t factors = 0;
for (int32_t i=2; i < n; i++) {
if (n % i == 0) factors += 1;
}
return factors == 0;
}
int main() {
int32_t limit = 20000;
int32_t total = 0;
for (int32_t i=2; i < limit; i++) {
if (is_prime(i)) total += 1;
}
printf("%d\n", total);
}
On my computer:
$ clang -O3 a.c
$ time ./a.out
2262
./a.out 0.13s user 0.00s system 98% cpu 0.137 total
It's slower than LPython, so perhaps something is not quite right yet.
CPython 3.10.4:
$ time python examples/expr2.py
2262
python examples/expr2.py 21.42s user 0.02s system 99% cpu 21.457 total
LPython with --fast
(3e0e0eca49001139fa68f58d9d444412a208d464)
$ lpython --fast examples/expr2.py
$ time ./a.out
2262
./a.out 0.57s user 0.00s system 99% cpu 0.573 total
About 38x
speedup than python
LPython without --fast
$ lpython examples/expr2.py
$ time ./a.out
2262
./a.out 3.61s user 0.00s system 99% cpu 3.606 total
About 6x
speedup than python
Clang:
$ clang -O3 -march=native a.c
$ time ./a.out
2262
./a.out 0.92s user 0.00s system 99% cpu 0.929 total
Ok, so LPython is faster than Clang for you as well. I think the C code is probably not optimal, or somehow our hand written modulo is faster than the one Clang generates. We should look into the llvm to understand better.
But overall not a bad start.
Here are the results on my machine (Apple MAC M1): 1.
% time PYTHONPATH=~/repos/lpython/src/runtime/ltypes python expr.py
2262
PYTHONPATH=~/repos/lpython/src/runtime/ltypes python expr.py 11.28s user 0.04s system 99% cpu 11.332 total
2.
% lpython --fast expr.py
% time ./a.out
2262
./a.out 0.20s user 0.01s system 97% cpu 0.216 total
3.
% clang -O3 expr.c -o. expr.out
% time ./expr.out
2262
./expr.out 0.16s user 0.00s system 98% cpu 0.169 total
Here is the quick sort benchmark:
from ltypes import i32, f32
from numpy import empty
import random
def _swap(arr: f32[500000], i: i32, j: i32):
tmp: f32
tmp = arr[i]
arr[i] = arr[j]
arr[j] = tmp
def Partition(arr: f32[500000], low: i32, high: i32) -> i32:
i: i32
i = low - 1
pivot: f32
pivot = arr[high]
j: i32
for j in range(low, high):
if (arr[j] <= pivot):
i += 1
_swap(arr, i, j)
_swap(arr, high, i+1)
return i+1
def QuickSort(arr: f32[500000], low: i32, high: i32):
if (low < high):
pivot: i32
pivot = Partition(arr, low, high)
QuickSort(arr, low, pivot-1)
QuickSort(arr, pivot+1, high)
def main():
arr: f32[500000]
arr = empty(500000)
i: i32
for i in range(500000):
arr[i] = random.random()
QuickSort(arr, 0, 500000-1)
print("Array sorted using Quick Sort: ")
for i in range(500000):
print(arr[i])
main()
Equivalent C code:
#include<stdio.h>
#include <stdlib.h>
#include <time.h>
void swap (float* a, float* b)
{
float tmp = *a;
*a = *b;
*b = tmp;
}
int Partition(float arr[], int low, int high){
int i = low - 1;
float pivot = arr[high];
for(int j = low; j < high; j++) {
if(arr[j] <= pivot) {
i++;
swap(&arr[i], &arr[j]);
}
}
swap(&arr[high], &arr[i+1]);
return i+1;
}
void QuickSort(float arr[], int low, int high){
if(low < high) {
int pivot = Partition(arr, low, high);
QuickSort(arr, low, pivot-1);
QuickSort(arr, pivot+1, high);
}
}
int main() {
srand(time(0));
int size = 500000;
float arr[size];
for (int i = 0; i < size; i++)
arr[i] = (rand() / (double) RAND_MAX);
QuickSort(arr, 0, size-1);
printf("Array sorted using Quick Sort: ");
for (int i = 0; i < size; i++)
printf("%f ", arr[i]);
printf("\n");
return 0;
}
CPython:
$ time python examples/expr2.py > tmp
python examples/expr2.py > tmp 10.00s user 0.60s system 108% cpu 9.729 total
LPython:
$ lpython examples/expr2.py; time ./a.out > tmp
./a.out > tmp 0.86s user 2.10s system 99% cpu 2.961 total
LPython with fast
:
$ lpython --fast examples/expr2.py
Internal Compiler Error: Unhandled exception
Traceback (most recent call last):
Binary file "/home/thirumalai/Open_Source/lpython/src/bin/lpython", in _start()
File "./csu/../csu/libc-start.c", line 392, in __libc_start_main_impl()
File "./csu/../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main()
File "/home/thirumalai/Open_Source/lpython/src/bin/lpython.cpp", line 879, in ??
err = compile_python_to_object_file(arg_file, tmp_o, runtime_library_dir, compiler_options, time_report);
File "/home/thirumalai/Open_Source/lpython/src/bin/lpython.cpp", line 373, in ??
res = fe.get_llvm3(*asr, diagnostics);
File "/home/thirumalai/Open_Source/lpython/src/lpython/python_evaluator.cpp", line 57, in LFortran::PythonCompiler::get_llvm3(LFortran::ASR::TranslationUnit_t&, LFortran::diag::Diagnostics&)
run_fn);
File "/home/thirumalai/Open_Source/lpython/src/libasr/codegen/asr_to_llvm.cpp", line 4762, in LFortran::asr_to_llvm(LFortran::ASR::TranslationUnit_t&, LFortran::diag::Diagnostics&, llvm::LLVMContext&, Allocator&, LFortran::Platform, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
pass_inline_function_calls(al, asr, rl_path);
File "/home/thirumalai/Open_Source/lpython/src/libasr/pass/inline_function_calls.cpp", line 394, in LFortran::pass_inline_function_calls(Allocator&, LFortran::ASR::TranslationUnit_t&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
LFORTRAN_ASSERT(asr_verify(unit));
File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 533, in LFortran::asr_verify(LFortran::ASR::TranslationUnit_t const&, bool)
v.visit_TranslationUnit(unit);
File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 111, in LFortran::ASR::VerifyVisitor::visit_TranslationUnit(LFortran::ASR::TranslationUnit_t const&)
this->visit_symbol(*a.second);
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3365, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_symbol(LFortran::ASR::symbol_t const&)
void visit_symbol(const symbol_t &b) { visit_symbol_t(b, self()); }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3166, in ??
case symbolType::Function: { v.visit_Function((const Function_t &)x); return; }
File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 280, in LFortran::ASR::VerifyVisitor::visit_Function(LFortran::ASR::Function_t const&)
visit_stmt(*x.m_body[i]);
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3379, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_stmt(LFortran::ASR::stmt_t const&)
void visit_stmt(const stmt_t &b) { visit_stmt_t(b, self()); }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3185, in ??
case stmtType::Assignment: { v.visit_Assignment((const Assignment_t &)x); return; }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3615, in LFortran::ASR::BaseWalkVisitor<LFortran::ASR::VerifyVisitor>::visit_Assignment(LFortran::ASR::Assignment_t const&)
self().visit_expr(*x.m_value);
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3421, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_expr(LFortran::ASR::expr_t const&)
void visit_expr(const expr_t &b) { visit_expr_t(b, self()); }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3268, in ??
case exprType::ArrayRef: { v.visit_ArrayRef((const ArrayRef_t &)x); return; }
File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 365, in LFortran::ASR::VerifyVisitor::visit_ArrayRef(LFortran::ASR::ArrayRef_t const&)
visit_array_index(x.m_args[i]);
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 4342, in LFortran::ASR::BaseWalkVisitor<LFortran::ASR::VerifyVisitor>::visit_array_index(LFortran::ASR::array_index_t const&)
self().visit_expr(*x.m_right);
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3421, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_expr(LFortran::ASR::expr_t const&)
void visit_expr(const expr_t &b) { visit_expr_t(b, self()); }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3232, in ??
case exprType::BinOp: { v.visit_BinOp((const BinOp_t &)x); return; }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3898, in LFortran::ASR::BaseWalkVisitor<LFortran::ASR::VerifyVisitor>::visit_BinOp(LFortran::ASR::BinOp_t const&)
self().visit_expr(*x.m_left);
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3421, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_expr(LFortran::ASR::expr_t const&)
void visit_expr(const expr_t &b) { visit_expr_t(b, self()); }
File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3267, in ??
case exprType::Var: { v.visit_Var((const Var_t &)x); return; }
File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 357, in LFortran::ASR::VerifyVisitor::visit_Var(LFortran::ASR::Var_t const&)
require(symtab_in_scope(current_symtab, x.m_v),
LFortranException: ASR verify failed: Var::m_v `high_Partition` cannot point outside of its symbol table
Clang:
$ clang -O3 -march=native quickSort.c; time ./a.out > tmp
./a.out > tmp 0.31s user 0.04s system 99% cpu 0.348 total
Perfect! Can you comment out the pass_inline_function_calls(al, asr, rl_path);
line in asr_to_llvm
? That should make it work. The inline is probably trying to inline a recursive procedure, which won't work.
Yup, thanks for that. It resolves the above issue, but I get core dumped.
$ lpython --fast examples/expr2.py; time ./a.out > tmp
[2] 330898 segmentation fault (core dumped) ./a.out > tmp
./a.out > tmp 0.78s user 1.91s system 93% cpu 2.894 total
$ lpython examples/expr2.py; time ./a.out > tmp
./a.out > tmp 0.87s user 1.85s system 99% cpu 2.724 total
That's another bug that we have to investigate and fix. You can comment all the optimizations for --fast
and see if it works or not. If it still fails, it might be a bug in LLVM (possible, but unlikely). If it works, then you can "bisect" to figure out what ASR optimization causes that.
Core dumped is caused by pass_loop_unroll(al, asr, rl_path)
in asr_to_llvm
Perfect. What is the performance of the --fast
version compared to Clang?
Benchmark:
CPython 3.10.2 speed on Apple M1 Max:
LPython speed:
So about 55x speedup.