Small string optimization

This is an optimization which Swift does, as does the C++ stdlib. Of note, Rust does not do it, and neither does Go.

The general idea: (for this rest of this, we'll assume a 64-bit target; it works on 32-bit too, but the numbers are different)

Currently a Str takes up 16B on the stack; it's a (pointer : usize, length : usize) tuple. The pointer points to an array of UTF-8 bytes on the heap.
The small string optimization is where you special-case the situation where you have a "small string" (15 or fewer UTF-8 code points) to be stored inline in the 16B you already have, without using the heap at all.

Tradeoffs

This optimization has several tradeoffs:

Benefit: Small strings don't allocate on the heap and never need to be deallocated. (Often Roc strings will be bump-allocated in practice anyway, so this is not always as big a benefit as it might be in, say, Swift.)
Benefit: Small strings have great cache locality because you don't have to chase a (potentially distant) pointer to get their contents, which might require going all the way to main memory. This, to me, is the biggest selling point of this optimization in Roc.
Benefit: Small strings take up at least 8B less total memory per string, because they don't have to store the pointer, and also don't have to pad the heap-allocated bytes for alignment. That's not much on a per-string basis, but can add up if a lot of strings are in memory - e.g. if each string happens to be 9-15 bytes in length, they will all take up twice as much total memory (16B struct plus 16B alignment-padded bytes on the heap, compared to just the 16B struct).
Cost: Doing this optimization means many operations on strings require extra instructions. This adds a few extra cycles and increases binary size. (Of note, these extra instructions are far less costly than memory lookups unless the memory in question happens to already be in the L1 cache. Also note that these can be done using a cmov instruction, like folly/fbstring does, which not only doesn't have mispredictions, it also runs faster than correctly predicted branch jumps on modern CPUs.)
Cost: The compiler code around strings gets more complex.
Cost: Some tag unions may need bigger runtime representations due to Str not having a pointer for use as a tagged pointer in which to store the discriminant - see https://github.com/rtfeldman/roc/issues/322 for details. (If there is exactly one Str in the union, it can be used as the default branch as long as no other discriminants get assigned the bit pattern 000 - which the Str could have if it's a large Str and thus has a pointer in it - and also no other discriminants are assigned bit patterns ending in 1, since that would indicate a small Str. This works as long as Roc strings only ever store pointers directly returned by malloc, and are not used as string slices that can potentially point to arbitrary individual bytes within other strings.)

In Practice

Opponents of this optimization seem to say it introduces complexity and generated code bloat, and isn't worth it. Proponents (e.g. here, here, and here) suggest it is indeed worth it in practice.

This optimization is potentially a huge deal for applications that deal with lots of small strings. One example of such an application would be Roc's editor. Think of how many strings for tag names (e.g. True, False, Ok) and record field names would be small strings! Editor plugins will be written in Roc, so this optimization would directly impact cache locality and total memory usage of the editor.

Overall, it seems like most applications won't notice, and then for some applications it'll be incredibly noticeably positive. It's hard to imagine an application where it would be very noticeably negative in terms of runtime performance; the most likely negative impact would be final binary size, which Roc already intentionally sacrifices in order to improve runtime performance (see for example aggressive function inlining and monomorphization).

Implementation Notes

To work best with https://github.com/rtfeldman/roc/issues/322 we'd want to store the "is this string small?" flag in the lsbit of the pointer. On little-endian systems, this would be in the first byte, and then the following 15 bytes would be the small string's contents. (On big-endian systems, we'd want to reverse the tuple such that it would be (length, pointer) instead of (pointer, length) and then the first 15 bytes would be the string's data instead, followed by the metadata byte at the end.)

If the "this is a small string" flag is set to 1, then the entire byte surrounding that flag becomes the (bit-shifted) length of the small string. 7 bits can represent lengths of up to 128, but we already know the lengths only go up to 15 (or 31 in a hypothetical future 128-bit system), so 7 bits is plenty of room to store the length.

Since the max length of a string is the highest isize rather than usize (because pointers need to do signed arithmetic), it's also possible to store the flag as the msbit of length, instead of using the pointer. This does not interact well with https://github.com/rtfeldman/roc/issues/322 though.

I think this is what Facebook does in their C++ string library, because they do a neat trick where they store the length in the final byte of the string, and they store it as (1-length) so that it's a 0 when you're using all 15 bytes in the small string. This means you conveniently have a 15-byte string followed by a 0, which serves as a nul-terminator, and can pass a pointer to the whole 16 bytes directly to a function that wants a C string.

I don't think that design would pay off in Roc, given how infrequently Roc will be converting to C strings compared to a C++ library.

Swift apparently uses some of the spare bits to store additional flags for string modes other than "small string" and "large string." We could do this too, but the cost would be further code bloat, so we shouldn't do it without good reason.

roc-lang / roc