Variadic vm ops? - Githubissues

primo-ppcg commented 1 year ago

One thing I've stumbled over a few times is that variadic function calls appear to be much slower than fixed arity. However, when I examined it a bit closer, I found that's generally not the case. For example:

(def f1 |(string/slice :abc $))
(def f2 |(string/slice :abc ;$&))

(printf "%s %q" :f1 ((disasm f1) :bytecode))
(printf "%s %q" :f2 ((disasm f2) :bytecode))

(timeit-loop [:repeat 10_000_000] :f1 (f1 1))
(timeit-loop [:repeat 10_000_000] :f2 (f2 1))

# ->

f1 @[(ldc 1 0) (push2 1 0) (ldc 1 1) (tcall 1)]
f2 @[(ldc 1 0) (push 1) (pusha 0) (ldc 1 1) (tcall 1)]
f1 0.902s, 0.09022µs/body
f2 1.292s, 0.1292µs/body

Slightly slower, but not so significant. The only difference here is the extra pusha, which is necessary, and the implementation of pusha does appear to be about as trim as possible.

However, there is as significant difference when the function is a vm op:

(def f1 |(< 0 $))
(def f2 |(< 0 ;$&))

(printf "%s %q" :f1 ((disasm f1) :bytecode))
(printf "%s %q" :f2 ((disasm f2) :bytecode))

(timeit-loop [:repeat 10_000_000] :f1 (f1 1))
(timeit-loop [:repeat 10_000_000] :f2 (f2 1))

# ->

f1 @[(ldi 2 0) (lt 1 2 0) (ret 1)]
f2 @[(ldi 1 0) (push 1) (pusha 0) (ldc 1 0) (tcall 1)]
f1 0.333s, 0.03326µs/body
f2 1.718s, 0.1718µs/body

When called with a splice, operators which are implemented directly in the vm are evaluated via call/tcall rather than being evaluated directly, which in this example is over 5 times slower. My question is: must it necessarily be so?

It seems that if the compiler can determine that the function to be called is a vm op in the fixed arity case, it should also be able to do so in the variadic case, and then use a call which isn't quite so expensive.

It might also be worth taking a look at call/tcall, because they do seem to be significant performance bottlenecks.

bakpakin commented 1 year ago

I'm not sure how implementing variadic VM ops would help - the issue seems to be the function call overhead, which is unfortunately part of how the VM works. The easiest solution would function inlining, which isn't done right now for debugging purposes, although the nature of Janet's default early binding makes this doable in principle rather simply (most of the time, functions are not going to be redefined dynamically such that an inlinging would become invalid. And when this is the case we can determine this statically).

Many interpreters make function calls faster by avoiding pushing data to the stack - for example, point to a contiguous span of slots and use them as the arguments. If the function being called is a C function, no new stack frame needs to be created. Janet doesn't do this because it makes the compiler a little more complicated and isn't easily adaptable to splicing. But even this is much slower than just inlining the function.

primo-ppcg commented 1 year ago

The idea is that there wouldn't actually need to be a function call. For example:

(def f1 |(<= 400 $ 600))
(def f2 |(<= 400 ;$& 600))
(def f3
  (asm
    ~{:bytecode @[(ldi 2 400)
                  (ldn 4)
                  (next 4 0 4)
                  (jmpni 4 6)
                  (in 3 0 4)
                  (lte 1 2 3)
                  (jmpno 1 5)
                  (movn 2 3)
                  (jmp -6)
                  (ldi 3 600)
                  (lte 1 2 3)
                  (ret 1)]
      :max-arity 65535
      :vararg true}))

(pp ((disasm f1) :bytecode))
(pp ((disasm f2) :bytecode))
(pp ((disasm f3) :bytecode))

(for i 0 1000
  (assert (= (f1 i) (f2 i) (f3 i))))

(timeit-loop [:repeat 10_000 i :range[0 1000]] :f1 (f1 i))
(timeit-loop [:repeat 10_000 i :range[0 1000]] :f2 (f2 i))
(timeit-loop [:repeat 10_000 i :range[0 1000]] :f3 (f3 i))

# ->

f1 @[(ldi 2 400) (lte 1 2 0) (jmpno 1 3) (ldi 2 600) (lte 1 0 2) (ret 1)]
f2 @[(ldi 1 400) (push 1) (pusha 0) (ldi 1 600) (push 1) (ldc 1 0) (tcall 1)]
f3 @[(ldi 2 400) (ldn 4) (next 4 0 4) (jmpni 4 6) (in 3 0 4) (lte 1 2 3) (jmpno 1 5) (movn 2 3) (jmp -6) (ldi 3 600) (lte 1 2 3) (ret 1)]
f1 0.364s, 0.03637µs/body
f2 2.236s, 0.2236µs/body
f3 1.025s, 0.1025µs/body

which is already about twice as fast. Because the argument is known to be an array/tuple, the next/in could definitely be optimized as well (ldan/LOAD_ARRAY_NEXT, or similar). I also think that the inner loop could be encapsulated as a single vm op, which I suspect would be nearly as fast as the fixed arity case.

janet-lang / janet

Variadic vm ops? #1212