tinygo-org / tinygo

Go compiler for small places. Microcontrollers, WebAssembly (WASM/WASI), and command-line tools. Based on LLVM.
https://tinygo.org
Other
14.72k stars 858 forks source link

runtime: Rewrite hashmap data structure #4297

Open lpereira opened 2 weeks ago

lpereira commented 2 weeks ago

This structure now deviates quite a bit from the one used in the Big Go implementation, and is more suitable for enviroments with severe memory constraints.

A hashmap contains a linked list of buckets, which can store up to 8 elements; this number of buckets are pre-initialized depending on the size hint during hashmap instantiation. An additional bucket is allocated every time it's necessary; a bucket will have 100% utilization before needing a new one.

Only the topmost 8 bits of the hash of a key (tophash) is stored per key, and multiple keys with the same tophash can exist in the same bucket. A tophash of 0 indicates an empty slot in a bucket. A fast byte scan is used to determine which slots correspond to a particular tophash inside a bucket, which makes up for the fact that a linear scan through every bucket is necessary to find one that might contain a particular key.

Bucket slots aren't indexed by the hash of a key, so there's no need to carry a seed per hashmap instance to avoid issues such as the one described by Crosby and Wallach[1].

A bucket will have 100% utilization before we need another one, and we use a fast byte lookup to know which bucket corresponds to a particular tophash. There's no rehashing when the hashmap grows either, which is done to avoid generating too much garbage and reusing memory allocations as much as possible.

Some things that I still want to do include:

[1] https://www.usenix.org/legacy/events/sec03/tech/full_papers/crosby/crosby.pdf

CC @dgryski

lpereira commented 2 weeks ago

(I found a way to improve this quite a bit, and am working on it. It should reduce per-bucket memory usage at the same time it makes lookups faster.)

dgryski commented 1 week ago

Since this seems to be shifting the tradeoffs of the hashmap data structure, I wonder if we want to make it build-time configurable with the current implementation (similar to how gc's are also selectable).

aykevl commented 1 week ago

@lpereira could you maybe do some benchmarks to compare the two implementations? That will help to determine whether we need to keep both of them around. I'd like to avoid having to select between the two if possible, because it increases the maintenance burden and also extra options aren't necessarily a good thing for users (if we have multiple implementations, it's probably the compiler that should pick the best implementation for a given system).