Perform some optimizations during interpreter code generation

dolio commented 3 weeks ago

This PR implements two optimizations for the interpreter.

When generating MCode, we look up arities for functions and emit more efficient Call instructions. This allows things like self-recursive loops to be more efficient.
- It also fixes an opposite-logic bug with the stack size check in that path. Chris noticed it a while back, but it hadn't mattered because it was never used.
Just before generating MCode, it runs an inlining pass on the A-normal code. This checks for really simple wrapper-like code like (conceptually) x y -> f x y y, or stuff like x -> 3. If such things are fully or over-saturated, it will try to inline directly into the call site. Eventually, I think most builtins will be of this form, so saturated calls to builtins will be inlined to directly use the interpreter instructions.
- Like I said, this pass happens just before generating MCode, so it only affects what is actually run on the interpreter. We can't really do it earlier, because the inlined code isn't proper to send around to other machines, and the runtime hashes would change.

Also included are some tweaks to the pretty printing for debugging MCode generation. I think I'm going to try one more tweak to that before merging this.

dolio commented 3 weeks ago

Results of @pchiusano/misc-benchmarks

Without optimizations:

fib1
1.620115ms

fib2
5.263945ms

fib3
6.05794ms

Decode Nat
1.107µs

Generate 100 random numbers
750.962µs

List.foldLeft
5.181553ms

Count to 1 million
586.405189ms

Json parsing (per document)
657.471µs

Count to N (per element)
769ns

Count to 1000
768.651µs

Mutate a Ref 1000 times
1.3586ms

CAS an IO.ref 1000 times
1.799077ms

List.range (per element)
1.047µs

List.range 0 1000
1.110156ms

Set.fromList (range 0 1000)
5.253066ms

Map.fromList (range 0 1000)
3.9308ms

NatMap.fromList (range 0 1000)
18.409758ms

Map.lookup (1k element map)
11.931µs

Map.insert (1k element map)
23.333µs

List.at (1k element list)
1.25µs

Text.split /
72.824µs

With optimizations:

fib1
1.4078ms

fib2
4.922104ms

fib3
5.670696ms

Decode Nat
932ns

Generate 100 random numbers
647.73µs

List.foldLeft
4.492801ms

Count to 1 million
514.030489ms

Json parsing (per document)
611.919µs

Count to N (per element)
651ns

Count to 1000
668.335µs

Mutate a Ref 1000 times
1.114969ms

CAS an IO.ref 1000 times
1.491692ms

List.range (per element)
887ns

List.range 0 1000
956.614µs

Set.fromList (range 0 1000)
4.234707ms

Map.fromList (range 0 1000)
3.200286ms

NatMap.fromList (range 0 1000)
16.115533ms

Map.lookup (1k element map)
7.848µs

Map.insert (1k element map)
20.428µs

List.at (1k element list)
822ns

Text.split /
69.549µs

dolio commented 3 weeks ago

The inlining part doesn't seem to make much difference right now (at least on those benchmarks), but it should start being more significant with some of the stuff Chris is working on.

dolio commented 3 weeks ago

Here's a contrived example where it makes a difference:

indirect.myadd : Nat -> Nat -> Nat
indirect.myadd m n = m + n

indirect.myadd2 m n = myadd m n

indirect.myadd3 : Nat -> Nat -> Nat
indirect.myadd3 m = myadd2 m

indirect.loop acc = cases
  0 -> acc
  n -> indirect.loop (myadd3 1 acc) (n-1)

indirect.main = do timed do indirect.loop 0 10000000

without inlining
took: 6.36881337s

with inlining:
took: 4.942787153s

You can add more indirections and it doesn't make a difference with inlining, except I guess if you added more than 30 indirections, which would trigger my "we're in an infinite inlining loop" check. I could add more complications to handle cases like that, but it doesn't seem worth it right now.

unisonweb / unison

Perform some optimizations during interpreter code generation #5432