unisonweb / unison

A friendly programming language from the future
https://unison-lang.org
Other
5.81k stars 271 forks source link

Cache Pure Top Level Definitions on startup #5379

Closed ChrisPenner closed 1 month ago

ChrisPenner commented 1 month ago

Overview

Before this change all pure top-level definitions were evaluated at every call site, which is obviously unnecessary.

Now we can detect these using type info from the codebase, and inline the evaluated result into usage sites.

We also discussed whether to bake the evaluated results into the compiled artifact. The consensus was yes, to serialize the results; I'd still like to do that, but after a day of fiddling it proved tougher than I expected. Issues with that were:

  1. Closures can contain Foreigns, which are arbitrary Haskell values, most of these won't appear in a pure value, but some thing like stdin/stdout Handles do, patterns as well, and things like text/bytes literals. These are all serializable in their own way, but it takes some care to ensure we do it correctly.
  2. Closures contain argument references which refer to closures via Seg 'BX, but this is fixed in the Stack implementation to include inlined Sections, which we can't serialize. We can most likely either parameterize this type, or replace usage sites with a more generic version, but it affects a big chunk of things and is going to be a bit of a pain to figure out and propagate.

So serializing the pre-evalutated pure results is possible with more work, but it'd be nice to release the non-serializing version for now. The currently implemented version pre-evaluates cacheable values on executable startup.

When running something from a cold runtime (e.g. via ucm run or a fresh ucm session) there'll be a short delay while ucm evaluates everything, but it'll be snappy after that.

Implementation notes

It's a bit annoying, because in order to evaluate the pure defns I need the rest of all the combs to be in the right format for the runtime, so here's how it goes:

  1. Inline new combs
  2. Iterate over new cacheable definitions and evaluate them, replacing the comb in the map with the evaluated closure
  3. Re-inline new combs so they now inline the evaluated results.

Benchmarks

Made a test which references a 1k element map definition:

map1K = range 0 1000 |> map (i -> (i, i)) |> Map.fromList

pureTest : '{IO, Exception} ()
pureTest =
  do
    use List map replicate
    use Nat + - == range
    repeat n action =
      if n == 0 then ()
      else
        _ = action()
        repeat (n - 1) action

    printTime
      "Map.lookup (1k element map)" 1 let
        n -> repeat n do Map.get 12 map1K
=======Trunk
Map.lookup (1k element map)
3.361018ms

=======This Branch
Map.lookup (1k element map)
5.3µs

Loose ends

See overview about serializing closures.

dolio commented 1 month ago

Okay, so, maybe it's as simple as this...

Separate the Lam constructor of GComb into its own type. This is the type that should occur in PAp.

Make GComb a sum referring to this new type. The new type should be unpacked into GComb, and pattern aliases could make using it look essentially like the current GComb.

Occurrences of the new, separate type don't need to be lazy, because they aren't from knot tying, they're the concrete entry info for an actual combinator.

ceedubs commented 1 month ago

@ChrisPenner I'm curious about when this kicks in and what "on startup" means. Could you help me understand? Some of my questions:

Here's one case that I'm wondering about:

Historically it hasn't been too bad to have a (pure) test in a project that checks 100k samples and takes 10 seconds to evaluate, because the result gets cached. But are we going to pay that 10 second penalty just to cache [Ok "passed"] in more places now (like when I run my main method or when I switch into that project)?

ceedubs commented 1 month ago

I should have mentioned in my last comment that I'm excited about this change @ChrisPenner! I've seen repeated evaluation bite newcomers, and I've been curious about to what extent it might be hitting us with certain Pattern values, decoders, etc. Thank you!

ChrisPenner commented 1 month ago

@ceedubs

For run, run.compiled, and run.native, does it only evaluate/cache the dependencies of the function being run?

It will run and cache each pure top-level definition as part of adding it to the runtime's code-cache, we only load the dependencies of whatever we're trying to run, so it'll evaluate dependencies of a main the first time it's added to the runtime's code cache (e.g. once per ucm session, or once at the start of a run.compiled when we load the code from a uc file.)

You've specified that this applies to ucm run. Does it apply to run.compiled and run.native as well?

See above, for ucm run and run.compiled the pre-evaluation triggers for each thing the first time it's added to the code cache.

run.native doesn't benefit from this change at all unfortunately.

It sounds like this also applies to interactive ucm sessions. Does it cache the result of every pure top-level term in the codebase at startup? Or when you switch to a project does it compute only the ones with names in the current branch of the project? Or something else entirely?

It only pre-evaluates the dependencies of things you try to run, and will cache evaluation results for the lifespan of that runtime, which corresponds to the lifetime of the UCM process for the most part, e.g. if you run main it'll evaluate all cacheable definitions which main depends on (transitively). If you run main again, all the code is already loaded, so it won't do it again. If you change main it'll load any new hashes and evaluate anything cacheable in ONLY the new hashes, old ones are still cached.

Is this relevant for both the interpreter and the JIT runtime? Only the interpreter for now

Historically it hasn't been too bad to have a (pure) test in a project that checks 100k samples and takes 10 seconds to evaluate, because the result gets cached. But are we going to pay that 10 second penalty just to cache [Ok "passed"] in more places now (like when I run my main method or when I switch into that project)?

This is only an issue if you reference the test in your main.

We may want to look into optimizations for docs and termLinks where we don't necessarily depend on the actual result of a referenced definition 🤔

ChrisPenner commented 1 month ago

One downside of this change is if you have code like this:

main = do
  if something 
    then exit 0
    else Map.lookup k hugeExpensiveMap

Previously hugeExpensiveMap would only be evaluated when it was actually hit by the interpreter, so this program would exit immediately if something is true. Now we'll pre-evaluate hugeExpensiveMap once before even starting to run the program, which could lead to an increased startup time in some cases.

So this change will be good for longer-term processes like cloud and webservers, but potentially worse for short-lived processes like cli apps.

If the problem proves noticeable enough we probably can fix this delay by continuing on the Closure serialization work I mention in the overview, which I think we still want to do, the work was just starting to hit diminishing returns for how tricky it was when we have bigger possible gains with other work 😄 .

pchiusano commented 1 month ago

I noticed this broke the serialized form round trip golden tests. Which might be fine, since I'd expect this could change the serialized form for some definitions, but it's worth checking.

@SystemFw do we have Unison Cloud client tests to make sure that no hashes change for schema types and functions? The way these could work is to just crypto.hash various values and save as a hex string literal, and then the test does a comparison and fails if the hashes differ.

SystemFw commented 1 month ago

We now have a test in the Cloud client to test for the relevant hash changes. It passes on this branch.

aryairani commented 1 month ago

serializing the pre-evaluated pure results is possible with more work

fwiw, contrary to the majority opinion in the meeting, I don't think this would be a good idea. afaik other languages don't do it, apart maybe from some trivial constant folding — if someone really wants to precompute a pure result, they can always use add.run

pchiusano commented 1 month ago

@SystemFw cool, I'm good with merging this then, whenever @dolio and/or @ChrisPenner are ready.