Unboxed Primitive types

ChrisPenner commented 3 weeks ago

Overview

Up until now, the unboxed stack was used exclusively for control-flow within the interpreter itself; e.g. pattern-matching on data-type tags etc. Primitive types like Int and Nat would be unboxed from the boxed stack before every primitive op, then reboxed afterwards, which introduced ~3 extra instructions per primitive operation.

This change removes all that ceremony and now unboxed values are passed around and operated on directly, no need to repack them in between ops, cutting away those extra ~3 instructions per op.

We'll expect a speed up on primitive operations on unboxed types like addition and subtraction. Everything else should remain about the same.

Implementation notes

Every stack slow now conceptually stores a data Val = Val !Int !Closure
All the extra box/unboxing in between ops is removed.
Adds a bunch of stack debugging helpers that have proven useful. They're behind CPP flags so they can't possibly slow down the optimized build.

Test coverage

Benchmarks:

trunk -> this branch

fib1
710.259µs -> 545.141µs

fib2
2.65261ms -> 2.577623ms

fib3
3.006103ms -> 2.89451ms

Decode Nat
505ns -> 428ns

Generate 100 random numbers
301.766µs -> 251.432µs

List.foldLeft
2.523171ms -> 2.232896ms

Count to 1 million
277.6466ms -> 204.6666ms

Json parsing (per document)
283.115µs -> 274.763µs

Count to N (per element)
337ns -> 278ns

Count to 1000
337.363µs -> 286.412µs

Mutate a Ref 1000 times
572.589µs -> 487.293µs

CAS an IO.ref 1000 times
778.63µs -> 675.357µs

List.range (per element)
473ns -> 370ns

List.range 0 1000
485.818µs -> 405.503µs

Set.fromList (range 0 1000)
2.189539ms -> 2.145909ms

Map.fromList (range 0 1000)
1.594853ms -> 1.433498ms

NatMap.fromList (range 0 1000)
7.605928ms -> 6.144537ms

Map.lookup (1k element map)
4.188µs -> 3.49µs

Map.insert (1k element map)
10.254µs -> 8.331µs

List.at (1k element list)
444ns -> 372ns

Text.split /
31.704µs -> 34.373µs

Loose ends

The ANF operations still have BX/UN tags denoting the stack of each arg, but these are redundant and entirely unused, I'll take the time to remove them after this PR, but that's another large sweeping code change and it's good to merge into trunk in between large changes.

Note that, for now, we do still store the unboxed value's type on the boxed stack, this is for two reasons:

We use it when decompiling to know whether the bytes on the unboxed stack are a char, int, nat, double, etc.
Currently we sometimes need to determine whether a given stack slot is boxed or unboxed, and we can do that by looking at the boxed side of the stack.

We may in the future be able to avoid this with some re-arrangements, which would allow us to avoid one extra array write.

ChrisPenner commented 2 weeks ago

Hrmm, not sure how Dan's commits somehow got re-included here? I'll figure that out.

pchiusano commented 2 weeks ago

Nice!

Count to 1000 337.363µs -> 286.412µs

What do you think is going on here? I would expect we can get this down to 10-20µs based on the previous toy interpreter which was managing 75 million ops / s, which would be about 13µs to count to 1000.

Basically, what's still left todo that we're expecting big gains from, or is this a case of the interpreter being phrased that something isn't being inlined or there's allocation or function call overhead in a place we don't intend?

dolio commented 2 weeks ago

Nat.drop is still not a single instruction that gets inlined. That's one of the TBD comments.

Using == 0 also still involves a non-trivial wrapper.

ChrisPenner commented 2 weeks ago

Cody says we're passing the nimbus test-suite, he's going to run it in staging for a bit.

dolio commented 2 weeks ago

Is this otherwise ready to go?

ChrisPenner commented 2 weeks ago

@dolio it was until I introduced an infinite loop with that most recent change 🙃

dolio commented 2 weeks ago

Oh, is that why CI has been taking so long? :)

ChrisPenner commented 2 weeks ago

Yup, I had a cyclic pattern definition, just fixing that now :)

ChrisPenner commented 2 weeks ago

Oh, I should probably also give the universalCompare logic one last check to make sure it's consistent with the old version

ceedubs commented 2 weeks ago

When we deployed this to the Unison Cloud staging environment everything worked except we did experience hangs with one specific program. We should hold off on merging this until we figure that out.

ChrisPenner commented 1 day ago

Looks like it's up and running on staging now, which was the last blocker.

Merging!

unisonweb / unison