Closed ianthehenry closed 1 week ago
A simple fix for this is to just add some extra hash mixing at the end of the string hash. We already do this for tuples and other structures to improve the hashing quality - for strings we are still just using a very basic DJB hash.
Pushed 5d1bd8a9320787c11b8eba10c78ad020840c85e6 that uses our existing janet_hash_mix
routine to add the string length into the string hash as well a make sure gensyms are far apart.
It's my own fault but some of my tests seem to have been relying on internal details and they now fail with https://github.com/janet-lang/janet/commit/5d1bd8a9320787c11b8eba10c78ad020840c85e6.
As a specific example, as might be expected, the return value of things like pairs
can be different from before.
With the change I now get:
$ janet
Janet 1.37.0-dev-5d1bd8a9 linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:b 2) (:a 1)]
before the change this was:
$ janet
Janet 1.37.0-dev-bafa6bff linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:a 1) (:b 2)]
Time to rewrite some tests perhaps (^^;
Update: tests have been updated. Found and fixed some tests that were problematic for other reasons so perhaps there was a net gain :)
Running Bauble's test suite before and after the new hash function:
Benchmark 1: ~/bin/janet-bafa6bff jpm_tree/bin/judge
Time (mean ± σ): 654.0 ms ± 3.8 ms [User: 632.7 ms, System: 19.8 ms]
Range (min … max): 648.5 ms … 657.7 ms 10 runs
Benchmark 2: ~/bin/janet-5d1bd8a9 jpm_tree/bin/judge
Time (mean ± σ): 607.1 ms ± 2.2 ms [User: 586.7 ms, System: 18.9 ms]
Range (min … max): 604.1 ms … 610.9 ms 10 runs
Not bad!
Hurray for ⌊(2^32)⁄𝜙⌋.
While debugging #1519 I noticed that the symbol cache is very densely packed with gensyms, which made the hash collision that breaks
symbol/slice
far more likely than it "should" be. (The application that actually triggered that bug had to traverse 4072 full buckets during the lookup in question, even though I think the resizing tries to keep the cache only half full.)A cheap workaround for this particular issue would be to hash strings in reverse order, which seems worth it given how much of the symbol cache consists of gensyms during compilation, but I didn't want to do that in case someone proposes an altogether better hash function.