MTY_HashMap spends CPU time to compute a 32-bit hash of each key string, but then throws the hash away, which makes it necessary to do a string comparison against every key in a bucket. In the case that there are many more keys than buckets, the cost is 1/4 of a prefetcher-predictable cacheline + 1 dependent memory fetch per key (a char* to RAM that the prefetcher can't predict). after this PR, 1/16 of a predictable cacheline is read per key, and there are ~0 dependent memory fetches per key.
hash map data structures traditionally store a contiguous array of hashes in memory, such that with no extra effort on the programmer's part, prefetcher hardware and compiler vectorization and SIMD instructions will accelerate searches by a factor of 10X or more, compared to when no hash is stored and string comparisons are scattered in RAM.
this PR stores the 32-bit hashes in a contiguous array, which is expected to raise worst case performance by about 10X.
in the future, the first 8 bits of the hash can instead be stored contiguously in an auxiliary array, such that for each 1/4 cacheline read by the prefetcher, a series of 16 keys can be disqualified in 3 SIMD instructions without branching with 93% probability
MTY_HashMap spends CPU time to compute a 32-bit hash of each key string, but then throws the hash away, which makes it necessary to do a string comparison against every key in a bucket. In the case that there are many more keys than buckets, the cost is 1/4 of a prefetcher-predictable cacheline + 1 dependent memory fetch per key (a char* to RAM that the prefetcher can't predict). after this PR, 1/16 of a predictable cacheline is read per key, and there are ~0 dependent memory fetches per key.
hash map data structures traditionally store a contiguous array of hashes in memory, such that with no extra effort on the programmer's part, prefetcher hardware and compiler vectorization and SIMD instructions will accelerate searches by a factor of 10X or more, compared to when no hash is stored and string comparisons are scattered in RAM.
this PR stores the 32-bit hashes in a contiguous array, which is expected to raise worst case performance by about 10X.
in the future, the first 8 bits of the hash can instead be stored contiguously in an auxiliary array, such that for each 1/4 cacheline read by the prefetcher, a series of 16 keys can be disqualified in 3 SIMD instructions without branching with 93% probability