Resolve issues with `Nat`/`Int` in Scheme

Update on this, we did a more detailed design review and came up with a better solution, Idea 3 below. Benchmarks confirm this is just as fast as using the native Nat and Int functions, so we're going with that. TL;DR: the JIT Nat and Int types will behave just like in the interpreter and be fast.

Original discussion recorded here for posterity.

In the Haskell interpreter, Nat and Int are implemented as 64 bit machine integers. However, Scheme (at least Chez and Racket) fixnums are tagged, and have fewer than 64 bits even in a 64 bit implementation. There are also no signed vs. unsigned fixnums, so at the moment, if we used them as the backing for Nat and Int we would have 61 bit integers and 60 bit naturals. We're guessing that this is related to using word aligned memory, which leaves 3 bits always 0 at the end of a 64 bit pointer.

Scheme/Racket has arithmetic operations that don't force you to use fixnums, though. For instance, addition will automatically convert to an arbitrary precision number if the result is too large to be stored in a fixnum. The downside is that operations on the arbitrary precision values are considerably slower (comparable to GHC's Integer). So, it is possible to cover the full 64 bit space this way, but values that actually need 64 bits to represent are slow.

After some testing, it seems like this is still the best way forward, though. Surprisingly (to me), it doesn't seem like using the general operations actually has a significant cost. A Racket loop that adds up many numbers doesn't seem to run slower using + than it does using unsafe-fx+ or similar. Presumably there's some optimization that allows the extra cost to only be incurred if an overflow happens, and otherwise the performance on small values is identical.

With that in mind, we decided on the following strategy for resolving this issue:

Use general arithmetic operations in the scheme implementation. This means that Int and Nat are technically arbitrary precision. But, you will only get full performance on ~61 bit values. Nevertheless, one needn't worry that code assuming 64 bits are available will fail
Only guarantee that Nat and Int can represent at least 64 bits of information. This means that the Haskell interpreter doesn't have to be rewritten to use arbitrary precision operations, which would be a considerable amount of work. When the interpreter is eventually retired, we can revisit the decision about what precision is required to be available.
Eventually introduce smaller integer/natural sizes, like Nat32 and Int16. These fit in a fixnum, and can be used for people who don't want to accidentally flip into a much slower mode of operation. It doesn't seem like Nat/Int64 will be possible to implement efficiently in Scheme, but hopefully the smaller sizes will be good enough for the relevant use cases (like bitwise operations and such).

Here are some things we considered, but they seem less desirable than the above plan:

Specify that Nat and Int only guarantee 60/61 bits, so that fixnum-only operations can be used
- This would require changes to assorted unison code to make sure the logic still makes sense, which is rather more of a problem than performance degradation at extreme values. Also, it doesn't seem that using dedicated fixnum operations actually gains any performance for programs that stay within the fixnum range.
Specify that Nat and Int are exactly 60/61 bits.
Specify that Nat and Int are arbitrary precision immediately
- Both these would require significant modifications to the Haskell interpreter
Implement identical (to Haskell) 64 bit behavior in Scheme in the mean time
- Doing things like (bitwise-and (...) #xffffffffffffffff) forces the code into the bignum regime, because the mask requires a bignum to represent. This means you never get the good performance for 61 bit and below values.
- By contrast, masking for 32-bit arithmetic seems to have a very marginal cost.
Somehow use FFI to have actual 64 bit operations from C
- I would be surprised if FFI overhead did not make this un-viable, although I haven't tested it. I'm not even certain it's really possible.

CC @pchiusano @jaredly

Update on this, we did a more detailed design review and came up with a better solution, Idea 3 below. Benchmarks confirm this is just as fast as using the native Nat and Int functions.

Including full discussion here for posterity.

Context

We'd like to decide how to represent Nat and Int in the Unison runtime, what the spec should be for these types going forward, and if we're changing anything about the spec, how do we roll this out with minimal impact to users?

This issue came up because the current JIT implementation doesn't have a good way of representing unboxed integers and natural numbers at full 64-bit width. Instead, both pointers and unboxed ints / nats occupy the same slots at runtime. Since pointers are byte-aligned, on a 64-bit architecture this leaves three bits which are always 0 which can store additional data to distinguish pointers from unboxed numbers. (See tagged pointer) Racket's integer type uses this to have fast unboxed arithmetic, with values getting promoted to arbitrary precision when they exceed the 61 bits.

This implementation choice is not uncommon; many runtimes use pointer tagging as it's simple to implement and has good performance. (The V8 JS runtime does, and even the Haskell spec for Int actually only requires "at least 30 bits of precision", specifically to allow for this sort of thing. GHC ended up going in a different direction.)

Notably, it works out nicely for higher-order code, even without fancy optimizations, which makes it a nice choice for functional languages where higher-order code is common. For instance, if you pass a f : Nat -> Nat to List.map, there's no boxing and unboxing happening on the calls to f while looping over the list. In contrast, in Haskell or Scala, you'd be reliant on f getting inlined and a worker-wrapper optimization to get this into a form where the arithmetic could happen unboxed. Unboxed arithmetic is faster and doesn't do needlessly memory allocation.

Desiderata

All else being equal, these are some good properties, in no particular order. There are tradeoffs among several of these desiderata:

Performance. The whole point of the JIT is to get good performance.
Ease of upgrade. Ideally the JIT is a drop-in replacement for the current runtime and just magically makes everything faster. If there's work that people need to do to upgrade their code to work with the JIT, we should strive to minimize that work.
As a special case of this, as much as possible, we should try to avoid silent changes to semantics that cause previously working code to behave differently. If we are okay changing semantics, we should do so in ways that aren't likely to affect lots of code people have written.
Minimizing fragmentation. All else being equal, it would be nice to have a single Int / Nat type as we do now which performs well, rather than multiple types floating around that people use inconsistently.
Timeliness. We'd like to ship the JIT sooner rather than later and we'd like to deploy it to Unison Cloud sooner rather than later.
A clear plan for rolling out any changes, including how we'll communicate to users.
What else?

Again, these desiderata are "all else being equal". There may not be an approach that fully satisfies all of the above.

Idea 1: Boxed `Nat` and `Int`, separate `Natural` and `Integer` types

The idea here: Nat and Int continue to be full 64-bit width and have the exact same semantics as today. The JIT will have to use boxed arithmetic to provide these types. Dan has done a microbenchmark with and reported this will be about 4-5x slower than using the unboxed primitives (this was with a version with fxvector with two fixnums in it - a version using bignums for everything + masking / mod was about 8x slower).

Idea 1a: alongside this change, introduce Natural and Integer builtin types, backed in the JIT by the fancy tagged pointers that support fast unboxed arithmetic. On the JIT, these will be more performant and don't require worrying about overflow. In the Haskell runtime, these can be backed by the usual boxed arbitrary precision numbers.

Rollout plan:

If we ignore 1a, there's nothing to do. Semantics are the same. We take a 4-5x performance hit and have boxed arithmetic happening everywhere.
If we're doing 1a, then there's the question of how to roll it out. Obviously, we could just introduce the new types in base and hope for the best. People will be inclined to upgrade their code to work with Natural or Integer since performance will be better on the JIT. But unless it were done in a coordinated way (in "dependency order", which feels very stop-the-world), it seems like it could be a huge mess.

Desiderata summary:

Not great for performance or fragmentation, quite good for everything else.

Idea 2: change spec of `Nat` and `Int`

We change the spec of Nat and Int as follows. They are "at least 64 bits". If an operation produces a number which does not fit in 64 bits, the result is unspecified (implementation dependent). Correct code that works with all implementations will use an explicit mod or masking as needed.

Idea 2a: We later change the spec of Nat and Int to be arbitrary precision, once the JIT is a full replacement for the Haskell runtime. We can introduce other width number types at that point too (like Nat32, Nat60, etc) which are guaranteed to be unboxed and have implicit mod / masking, though there's less need for them. It's likely that they're useful for some bit twiddling applications.

Desiderata summary:

This is best for performance and doesn't lead to any fragmentation (people just continue to use Nat and Int as before), but is worse for everything else.

Rollout plan:

There are variants of this idea. As a straw man, if we think the implicit 64-bit behavior only shows up when people use multiplication and Nat.shiftLeft, we could have these and only these operations continue to return 64-bit truncated results (by going through boxed implementations). Possibly kinda weird.

Idea 3: 64 bit `Int` and `Nat` via a separate check

This is an interesting and possibly performant way to get full 64-bit Int and Nat on the JIT, without a huge hit to performance, hence much less need for everyone to switch to using Natural and Integer. Here's the idea (pseudocode). I'll use multiplication as an example, but it works for other operations:

maxNat61 = 2^61 - 1
n1 * n2 =
  r = Natural.multiply n1 n2
  if r <= maxNat61 then r
  else Natural.mod r (2^64)

So in the fast path, it does the multiplication unboxed and is less than maxNat61, and returns immediately. No boxing or unboxing.

In the event that it doesn't fit in 61 bits, it calls the arbitrary precison mod operation on the result. This results in boxing, but won't affect "most" code.

Performance of this seems good from initial microbenchmark, about the same as unboxed arithmetic. We suspect that CPU branch prediction is working for us here. The fast path is almost always taken so it's very cheap on a modern CPU. That said, Dan is doing some further testing for actual generated Unison code, which isn't quite as clean as what he wrote by hand and may stymie the optimizer in some way that tanks the performance of this approach.

Assuming this works out though, this seems like our winner. We get the same semantics as the current Nat and Int types, fast performance, and hit all of our desiderata. We can still later introduce Natural and Integer at our leisure, but since performance is good with Nat and Int, people aren't compelled to migrate their code. Instead, Natural and Integer will be somewhat special purpose, for situations where you really do need arbitrary precision.

unisonweb / unison