Inline code for function calls in interpreter

ChrisPenner commented 2 months ago

Overview

Previously we stored only references to the code we wanted to jump to in things like function calls/applications even though the code each reference referred to was known at the time we translate to MCode.

This change statically resolves each reference to the code it refers to at code generation time, avoiding lookups at runtime.

Implementation notes

Parameterizes most of the MCode types over their comb type such that we can build the computations using CombIx, then "tie the knot" and resolve to Combs using a fixed point.
Add code to actually do the resolution in the appropriate spots (code-gen time, deserialization time)
~Also downgrades parser-typechecker to -O1 by Dan's suggestion, O2 is pretty intense (and slow) and can probably be sequestered off to just the runtime package now that it's split off to improve build times.~ Reverted by Paul's request

Interesting/controversial decisions

Test coverage

Existing tests should cover it.

Benchmarks:

@pchiusano/misc-benchmarks/chris:.suite

this-branch:
Decode Nat
748ns

Generate 100 random numbers
470.88µs

Count to 1 million
449.3574ms

Json parsing (per document)
377.202µs

Count to N (per element)
533ns

Count to 1000
547.879µs

Mutate a Ref 1000 times
1.011772ms

CAS an IO.ref 1000 times
1.259371ms

List.range (per element)
651ns

List.range 0 1000
666.46µs

Set.fromList (range 0 1000)
3.05485ms

Map.fromList (range 0 1000)
2.314902ms

Map.lookup (1k element map)
5.877µs

Map.insert (1k element map)
15.034µs

List.at (1k element list)
580ns

Text.split /
40.449µs

----------

trunk:
Decode Nat
947ns

Generate 100 random numbers
623.831µs

Count to 1 million
574.758ms

Json parsing (per document)
461.911µs

Count to N (per element)
690ns

Count to 1000
691.485µs

Mutate a Ref 1000 times
1.288178ms

CAS an IO.ref 1000 times
1.589608ms

List.range (per element)
875ns

List.range 0 1000
879.3µs

Set.fromList (range 0 1000)
3.849484ms

Map.fromList (range 0 1000)
2.987285ms

Map.lookup (1k element map)
7.849µs

Map.insert (1k element map)
19.599µs

List.at (1k element list)
768ns

Text.split /
52.997µs

@mitchellwrosen/mbta/@mitchellwrosen/branch-for-chris:.runTheBenchmark

this-branch:
read bytes: 0.201ms
utf8 decode byes: 0.165ms
json decode text: 387.17ms
jsonapi parse json: 126.466ms
mbta parse jsonapi: 139.128ms

trunk:
read bytes: 0.188ms
utf8 decode byes: 0.166ms
json decode text: 490.019ms
jsonapi parse json: 170.002ms
mbta parse jsonapi: 182.414ms

Loose ends

Nope.

pchiusano commented 2 months ago

Looking pretty good so far! Though less dramatic than I hoped. I'll be curious to see it with the same treatment applied to foreign function calls (not sure if you're planning separate PR for that), I'm guessing that will make the JSON decoding and other benchmarks that use a lot of foreign calls go faster.

Note that trunk is going to be worse the more code you have loaded, since the IntMap will have greater depth. A best case scenario for trunk is running benchmarks after freshly starting up UCM (or you can use compile command with ucm run.compiled).

But this isn't terribly realistic since (for instance) Unison Cloud nodes are kept running and are loading new code all the time. And most users aren't bouncing UCM willy nilly.

dolio commented 2 months ago

Reading through some of this, something occurred to me. Is there even a reason for us to number the term references at this point?

I guess it's a little more economical to make numbers for the references and store the latter once for e.g. compiled data. But for actual code functionality, we just make up numbers for the references, then get rid of the numbers by making things circular, right?

ChrisPenner commented 2 months ago

And yes, I'll do the Foreign Funcs as a separate PR so we can track each performance change individually.

ChrisPenner commented 2 months ago

@dolio

Is there even a reason for us to number the term references at this point? I guess it's a little more economical to make numbers for the references and store the latter once for e.g. compiled data. But for actual code functionality, we just make up numbers for the references, then get rid of the numbers by making things circular, right?

Yeah I think you're right, though tbh it's probably more work to remove them than it's worth at this point, I don't think they're doing any harm 🤷🏼‍♂️

dolio commented 2 months ago

Yeah, I wouldn't try to remove the numbering I guess.

Actually, the CI failures look suspicious. The mac one reports a stack overflow on the codeops test, so something there might be doing something bad with a circular representation.

ChrisPenner commented 2 months ago

There's an infinite loop when testing equality on recursive functions; I'm guessing we're naively crawling through and checking equality on all contained sections; probably just need to lower back to combix's when checking equality. Don't worry, we won't merge until CI is passing 😄

Failure case:

testcase = do
  f n = if n == 0 then 0 else f (Nat.drop 1 n)
  f == f

dolio commented 2 months ago

Oh, right.

Actually, you should just not derive Eq and Ord for RComb. It should test equality via the stored CombIx, no?

For that matter, I guess the Show instance for RComb could be more informative using the CombIx, too, right?

ChrisPenner commented 2 months ago

Haha, yeah I had the exact same ideas 😄

ChrisPenner commented 2 months ago

We've passed the nimbus test-suite now as well ✅

unisonweb / unison