Primitive operator functions are not inlined

ChengCat commented 5 years ago

Compile the following program with dalec a.dt -s ir -O4 --no-dale-stdlib (Dale is compiled with LLVM 6.0.1),

(import cstdio)

(def main (fn extern-c int (void)
  (def sum (var auto \ 0))
  (for (i \ 0) (< i 500000000) (incv i)
    (setv sum (+ sum i)))
  (printf "%d\n" sum)
  0))

I get:

[...]

; Function Attrs: alwaysinline norecurse nounwind readonly
define weak_odr i32 @"_Z1$2bii"(i32 %a, i32 %b) local_unnamed_addr #0 {
entry:
  %0 = add nsw i32 %b, %a
  ret i32 %0
}

[...]

; Function Attrs: nounwind
define i32 @main() local_unnamed_addr #2 {
entry:
  %0 = tail call i8 @"_Z1$3cii"(i32 0, i32 500000000)
  %1 = and i8 %0, 1
  %2 = icmp eq i8 %1, 0
  br i1 %2, label %_dale_internal_label_breaklabel_2, label %_dale_internal_label_continuelabel_1.preheader

_dale_internal_label_continuelabel_1.preheader:   ; preds = %entry
  br label %_dale_internal_label_continuelabel_1

_dale_internal_label_continuelabel_1:             ; preds = %_dale_internal_label_continuelabel_1.preheader, %_dale_internal_label_continuelabel_1
  %.03 = phi i32 [ %3, %_dale_internal_label_continuelabel_1 ], [ 0, %_dale_internal_label_continuelabel_1.preheader ]
  %.012 = phi i32 [ %4, %_dale_internal_label_continuelabel_1 ], [ 0, %_dale_internal_label_continuelabel_1.preheader ]
  %3 = tail call i32 @"_Z1$2bii"(i32 %.03, i32 %.012)
  %4 = tail call i32 @"_Z1$2bii"(i32 %.012, i32 1)
  %5 = tail call i8 @"_Z1$3cii"(i32 %4, i32 500000000)
  %6 = and i8 %5, 1
  %7 = icmp eq i8 %6, 0
  br i1 %7, label %_dale_internal_label_breaklabel_2, label %_dale_internal_label_continuelabel_1

_dale_internal_label_breaklabel_2:                ; preds = %_dale_internal_label_continuelabel_1, %entry
  %.0.lcssa = phi i32 [ 0, %entry ], [ %3, %_dale_internal_label_continuelabel_1 ]
  %8 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @_dvidrl0, i64 0, i64 0), i32 %.0.lcssa)
  ret i32 0
}

[...]

Note those _Z1$2bii calls, the primitive operator functions are not inlined, despite being marked as alwaysinline. This leads to very bad performance, and on my machine it's 10x slower than an equivalent C program compiled with gcc.

After some investigation, I think this is caused by inefficient use of LLVM. As indicated by LLVM command line tools, LLVM can easily optimize the whole thing into printing a constant. With the following commands,

> llvm-as a.dt.ll
> opt -O1 a.dt.bc -o O1.bc
> llvm-dis O1.bc

A file O1.ll is produced:

[...]

; Function Attrs: nounwind
define i32 @main() local_unnamed_addr #2 {
_dale_internal_label_breaklabel_2:
  %0 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @_dvidrl0, i64 0, i64 0), i32 1711656320)
  ret i32 0
}

[...]

I also think --no-dale-stdlib should always be enabled. Those Dale run-time functions are very small, and without them being properly inlined, performance of both compiled programs and compilation time (to run the macros) is largely degraded.

tomhrr commented 5 years ago

Thanks for this: I had noticed some of these problems when working on some recent updates to support newer versions of LLVM, but forgot to go back and fix them afterwards. The basic arithmetical/relational functions are now always inlined, and for the example program (a.dt), O1 or higher will cause it to be optimised into the printing of a constant.

tomhrr commented 5 years ago

There was an additional problem with those basic functions being retained in the output even when they weren't needed, which has now been fixed.

ChengCat commented 5 years ago

Hi, I have just tried to help testing the fixes. Dale now works beautifully with that example. Thanks!

I noticed two minor issues.

; in a.dt
(import cstdio)
(import macros)
(import b)

(def m (macro extern (void)
  (def sum (var auto \ 0))
  (for (i \ 0) (< i 500000000) (incv i)
    (setv sum (+ sum i)))
  (std.macros.mnfv mc sum)))
(def f (fn extern int (void)
  (def sum (var auto \ 0))
  (for (i \ 0) (< i 500000000) (incv i)
    (setv sum (+ sum i)))
  sum))

(def main (fn extern-c int (void)
  (printf "%d\n" (m))
  (printf "%d\n" (f))
  (printf "%d\n" (b.m))
  (printf "%d\n" (b.f))))

; in b.dt
(module b)
(import stdlib)
(import macros)
(namespace b
  (def m (macro extern (void)
    (def sum (var auto \ 0))
    (for (i \ 0) (< i 500000000) (incv i)
      (setv sum (+ sum i)))
    (std.macros.mnfv mc sum)))
  (def f (fn extern int (void)
    (def sum (var auto \ 0))
    (for (i \ 0) (< i 500000000) (incv i)
      (setv sum (+ sum i)))
    sum)))

The above two files are compiled as:

dalec -c b.dt -O4
dalec a.dt -O4 -o a

It is slow to use a macro from the same module, leading to slow compile time. Currently, it is already efficient to use macros from any other module, so I think this is minor issue.
The basic operator functions are retained in libb.bc.

tomhrr commented 5 years ago

Thanks for the extra comments. 2 is now fixed, but I don't think there's any way to fix 1, since the slowness appears to be due to compilation of the macro.

ChengCat commented 5 years ago

I thought the slowness of 1 was mainly caused by that, JIT compilation is another code path, and inline is not properly handled in that code path. But I am not sure.

I am closing this issue for now, since it is not important anyway.

ChengCat commented 5 years ago

I forgot to say.. Thank you!

tomhrr / dale

Primitive operator functions are not inlined #198