lcompilers / lpython

Python compiler
https://lpython.org/
Other
1.5k stars 158 forks source link

Benchmark for simple programs #520

Open certik opened 2 years ago

certik commented 2 years ago

Benchmark:

from ltypes import i32

def is_prime(n: i32) -> bool:
    factors: i32
    factors = 0
    i: i32
    for i in range(2, n):
        if n % i == 0:
            factors += 1
    return factors == 0

def main():
    limit: i32
    limit = 20000
    total: i32
    total = 0
    i: i32
    for i in range(2, limit):
        if is_prime(i):
            total += 1
    print(total)

CPython 3.10.2 speed on Apple M1 Max:

$ time PYTHONPATH=~/repos/lpython/src/runtime/ltypes python a.py
2262
PYTHONPATH=~/repos/lpython/src/runtime/ltypes python a.py  5.82s user 0.01s system 99% cpu 5.841 total

LPython speed:

$ lpython --fast a.py
$ time ./a.out
2262
./a.out  0.10s user 0.00s system 95% cpu 0.104 total

So about 55x speedup.

certik commented 2 years ago

But we can do better --- the modulo operation n % i is not optimal, it is currently using our Python implementation. The LLVM backend should intercept it and provide an optimize implementation. We might want to possibly move it into ASR itself.

certik commented 2 years ago

Here is a C implementation:

#include <stdint.h>
#include <stdio.h>

int32_t is_prime(int32_t n) {
    int32_t factors = 0;
    for (int32_t i=2; i < n; i++) {
        if (n % i == 0) factors += 1;
    }
    return factors == 0;
}

int main() {
    int32_t limit = 20000;
    int32_t total = 0;
    for (int32_t i=2; i < limit; i++) {
        if (is_prime(i)) total += 1;
    }
    printf("%d\n", total);
}

On my computer:

$ clang -O3 a.c
$ time ./a.out
2262
./a.out  0.13s user 0.00s system 98% cpu 0.137 total

It's slower than LPython, so perhaps something is not quite right yet.

Thirumalai-Shaktivel commented 2 years ago

Benchmark using Intel i5: Linux - Ubuntu 22

CPython 3.10.4:

$ time python examples/expr2.py
2262
python examples/expr2.py  21.42s user 0.02s system 99% cpu 21.457 total

LPython with --fast (3e0e0eca49001139fa68f58d9d444412a208d464)

$ lpython --fast examples/expr2.py 
$ time ./a.out
2262
./a.out  0.57s user 0.00s system 99% cpu 0.573 total

About 38x speedup than python


LPython without --fast

$ lpython examples/expr2.py 
$ time ./a.out
2262
./a.out  3.61s user 0.00s system 99% cpu 3.606 total

About 6x speedup than python


Clang:

$ clang -O3 -march=native a.c
$ time ./a.out
2262
./a.out  0.92s user 0.00s system 99% cpu 0.929 total
certik commented 2 years ago

Ok, so LPython is faster than Clang for you as well. I think the C code is probably not optimal, or somehow our hand written modulo is faster than the one Clang generates. We should look into the llvm to understand better.

But overall not a bad start.

Smit-create commented 2 years ago

Here are the results on my machine (Apple MAC M1): 1.

% time PYTHONPATH=~/repos/lpython/src/runtime/ltypes python expr.py
2262
PYTHONPATH=~/repos/lpython/src/runtime/ltypes python expr.py  11.28s user 0.04s system 99% cpu 11.332 total

2.

% lpython --fast expr.py 
% time ./a.out
2262
./a.out  0.20s user 0.01s system 97% cpu 0.216 total

3.

% clang -O3 expr.c -o. expr.out
% time ./expr.out
2262
./expr.out  0.16s user 0.00s system 98% cpu 0.169 total
Thirumalai-Shaktivel commented 2 years ago

Here is the quick sort benchmark:

from ltypes import i32, f32
from numpy import empty
import random

def _swap(arr: f32[500000], i: i32, j: i32):
    tmp: f32
    tmp = arr[i]
    arr[i] = arr[j]
    arr[j] = tmp

def Partition(arr: f32[500000], low: i32, high: i32) -> i32:
    i: i32
    i = low - 1
    pivot: f32
    pivot = arr[high]
    j: i32
    for j in range(low, high):
        if (arr[j] <= pivot):
            i += 1
            _swap(arr, i, j)
    _swap(arr, high, i+1)
    return i+1

def QuickSort(arr: f32[500000], low: i32, high: i32):
    if (low < high):
        pivot: i32
        pivot = Partition(arr, low, high)
        QuickSort(arr, low, pivot-1)
        QuickSort(arr, pivot+1, high)

def main():
    arr: f32[500000]
    arr = empty(500000)
    i: i32
    for i in range(500000):
        arr[i] = random.random()
    QuickSort(arr, 0, 500000-1)
    print("Array sorted using Quick Sort: ")
    for i in range(500000):
        print(arr[i])

main()

Equivalent C code:

#include<stdio.h>
#include <stdlib.h>
#include <time.h>

void swap (float* a, float* b)
{
    float tmp = *a;
    *a = *b;
    *b = tmp;
}

int Partition(float arr[], int low, int high){
    int i = low - 1;
    float pivot = arr[high];
    for(int j = low; j < high; j++) {
        if(arr[j] <= pivot) {
            i++;
            swap(&arr[i], &arr[j]);
        }
    }
    swap(&arr[high], &arr[i+1]);
    return i+1;
}

void QuickSort(float arr[], int low, int high){
    if(low < high) {
        int pivot = Partition(arr, low, high);
        QuickSort(arr, low, pivot-1);
        QuickSort(arr, pivot+1, high);
    }
}

int main() {
    srand(time(0));
    int size = 500000;
    float arr[size];
    for (int i = 0; i < size; i++)
        arr[i] = (rand() / (double) RAND_MAX);
    QuickSort(arr, 0, size-1);
    printf("Array sorted using Quick Sort: ");
    for (int i = 0; i < size; i++)
        printf("%f ", arr[i]);
    printf("\n");
    return 0;
}

CPython:

$ time python examples/expr2.py > tmp
python examples/expr2.py > tmp  10.00s user 0.60s system 108% cpu 9.729 total

LPython:

$ lpython examples/expr2.py; time ./a.out > tmp
./a.out > tmp  0.86s user 2.10s system 99% cpu 2.961 total

LPython with fast:

$ lpython --fast examples/expr2.py
Internal Compiler Error: Unhandled exception
Traceback (most recent call last):
  Binary file "/home/thirumalai/Open_Source/lpython/src/bin/lpython", in _start()
  File "./csu/../csu/libc-start.c", line 392, in __libc_start_main_impl()
  File "./csu/../sysdeps/nptl/libc_start_call_main.h", line 58, in __libc_start_call_main()
  File "/home/thirumalai/Open_Source/lpython/src/bin/lpython.cpp", line 879, in ??
    err = compile_python_to_object_file(arg_file, tmp_o, runtime_library_dir, compiler_options, time_report);
  File "/home/thirumalai/Open_Source/lpython/src/bin/lpython.cpp", line 373, in ??
    res = fe.get_llvm3(*asr, diagnostics);
  File "/home/thirumalai/Open_Source/lpython/src/lpython/python_evaluator.cpp", line 57, in LFortran::PythonCompiler::get_llvm3(LFortran::ASR::TranslationUnit_t&, LFortran::diag::Diagnostics&)
    run_fn);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/codegen/asr_to_llvm.cpp", line 4762, in LFortran::asr_to_llvm(LFortran::ASR::TranslationUnit_t&, LFortran::diag::Diagnostics&, llvm::LLVMContext&, Allocator&, LFortran::Platform, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
    pass_inline_function_calls(al, asr, rl_path);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/pass/inline_function_calls.cpp", line 394, in LFortran::pass_inline_function_calls(Allocator&, LFortran::ASR::TranslationUnit_t&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
    LFORTRAN_ASSERT(asr_verify(unit));
  File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 533, in LFortran::asr_verify(LFortran::ASR::TranslationUnit_t const&, bool)
    v.visit_TranslationUnit(unit);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 111, in LFortran::ASR::VerifyVisitor::visit_TranslationUnit(LFortran::ASR::TranslationUnit_t const&)
    this->visit_symbol(*a.second);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3365, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_symbol(LFortran::ASR::symbol_t const&)
    void visit_symbol(const symbol_t &b) { visit_symbol_t(b, self()); }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3166, in ??
    case symbolType::Function: { v.visit_Function((const Function_t &)x); return; }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 280, in LFortran::ASR::VerifyVisitor::visit_Function(LFortran::ASR::Function_t const&)
    visit_stmt(*x.m_body[i]);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3379, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_stmt(LFortran::ASR::stmt_t const&)
    void visit_stmt(const stmt_t &b) { visit_stmt_t(b, self()); }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3185, in ??
    case stmtType::Assignment: { v.visit_Assignment((const Assignment_t &)x); return; }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3615, in LFortran::ASR::BaseWalkVisitor<LFortran::ASR::VerifyVisitor>::visit_Assignment(LFortran::ASR::Assignment_t const&)
    self().visit_expr(*x.m_value);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3421, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_expr(LFortran::ASR::expr_t const&)
    void visit_expr(const expr_t &b) { visit_expr_t(b, self()); }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3268, in ??
    case exprType::ArrayRef: { v.visit_ArrayRef((const ArrayRef_t &)x); return; }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 365, in LFortran::ASR::VerifyVisitor::visit_ArrayRef(LFortran::ASR::ArrayRef_t const&)
    visit_array_index(x.m_args[i]);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 4342, in LFortran::ASR::BaseWalkVisitor<LFortran::ASR::VerifyVisitor>::visit_array_index(LFortran::ASR::array_index_t const&)
    self().visit_expr(*x.m_right);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3421, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_expr(LFortran::ASR::expr_t const&)
    void visit_expr(const expr_t &b) { visit_expr_t(b, self()); }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3232, in ??
    case exprType::BinOp: { v.visit_BinOp((const BinOp_t &)x); return; }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3898, in LFortran::ASR::BaseWalkVisitor<LFortran::ASR::VerifyVisitor>::visit_BinOp(LFortran::ASR::BinOp_t const&)
    self().visit_expr(*x.m_left);
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3421, in LFortran::ASR::BaseVisitor<LFortran::ASR::VerifyVisitor>::visit_expr(LFortran::ASR::expr_t const&)
    void visit_expr(const expr_t &b) { visit_expr_t(b, self()); }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/../libasr/asr.h", line 3267, in ??
    case exprType::Var: { v.visit_Var((const Var_t &)x); return; }
  File "/home/thirumalai/Open_Source/lpython/src/libasr/asr_verify.cpp", line 357, in LFortran::ASR::VerifyVisitor::visit_Var(LFortran::ASR::Var_t const&)
    require(symtab_in_scope(current_symtab, x.m_v),
LFortranException: ASR verify failed: Var::m_v `high_Partition` cannot point outside of its symbol table

Clang:

$ clang -O3 -march=native quickSort.c; time ./a.out > tmp
./a.out > tmp  0.31s user 0.04s system 99% cpu 0.348 total
certik commented 2 years ago

Perfect! Can you comment out the pass_inline_function_calls(al, asr, rl_path); line in asr_to_llvm? That should make it work. The inline is probably trying to inline a recursive procedure, which won't work.

Thirumalai-Shaktivel commented 2 years ago

Yup, thanks for that. It resolves the above issue, but I get core dumped.

$ lpython --fast examples/expr2.py; time ./a.out > tmp
[2]    330898 segmentation fault (core dumped)  ./a.out > tmp
./a.out > tmp  0.78s user 1.91s system 93% cpu 2.894 total
$ lpython examples/expr2.py; time ./a.out > tmp 
./a.out > tmp  0.87s user 1.85s system 99% cpu 2.724 total
certik commented 2 years ago

That's another bug that we have to investigate and fix. You can comment all the optimizations for --fast and see if it works or not. If it still fails, it might be a bug in LLVM (possible, but unlikely). If it works, then you can "bisect" to figure out what ASR optimization causes that.

Thirumalai-Shaktivel commented 2 years ago

Core dumped is caused by pass_loop_unroll(al, asr, rl_path) in asr_to_llvm

certik commented 2 years ago

Perfect. What is the performance of the --fast version compared to Clang?

czgdp1807 commented 2 years ago

The LLVM backend should intercept it and provide an optimize implementation

LLVM already provides its remainder instruction for integers. See URem and SRem.