feat(stdlib): Faster memory allocator

ospencer commented 3 months ago

Thanks to @mortax for feedback on the current allocator.

The current allocator implements malloc and free in O(n) to the size of the free list. This new implementation maintains separate free lists for small allocations and large allocations, and implements malloc and free in O(1) for small allocations and O(n) malloc to the size of the large free list and O(1) free.

The cost of this extra performance is a small amount of extra memory use and additional bookkeeping. For example, a List cons cell currently uses 40 bytes of memory, but now uses 64 bytes, as all small allocations are 64 bytes. For implementation details, see the large comment in malloc.gr.

The majority of allocations are very small, so this PR greatly improves performance of the allocator. Small allocations include:

Numbers (with the exception of large BigInts/Rationals)
Tuples/Arrays up to 8 elements
Records up to 6 elements
Variants up to 5 elements
Closures up to 6 elements
Bytes/Strings up to length 32

Closes #545

mortax commented 3 months ago

@peblair any chance you can peruse this?

peblair commented 3 months ago

This PR is awesome! I believe the new malloc implementation makes sense to me, but I have one question: why is there no morecore call anywhere for the small block list? I think this may just be a gap in my understanding of what the proposed memory layout is. cc @mortax

ospencer commented 3 months ago

Small allocations just return a block from the small block list, and if none are available the code path is exactly the same as the one for large blocks, i.e. take the next large block, take a chunk off off of it, return that. If there are no blocks left in the large list then it will call morecore.

ospencer commented 3 months ago

Small blocks and large blocks have the same memory layout. The only difference is which free list they live in, which is mostly the job of free to manage. malloc is just able to take a shortcut for small blocks if some are available. It's O(1) for small blocks because of the invariant that the large block list only contains blocks that are larger than small blocks, so we can always take a chunk of the first large block to create a small block, without having to search for one that's big enough.

ospencer commented 3 months ago

I said this during our community meeting, but I'll put it here too: FWIW, we originally exported getFreePtr for testing and now we expose leakAll instead, so there's no delta for testing functions. We never actually interface with malloc through this module, but rather the Memory module instead, which does not expose this function. If we don't expose this function then it's not really possible to test the functionality.

@peblair I'll update the contributor docs.

grain-lang / grain

feat(stdlib): Faster memory allocator #2124