janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.56k stars 229 forks source link

default string hash produces a lot of consecutive ints for (gensym) symbols #1520

Closed ianthehenry closed 1 week ago

ianthehenry commented 1 week ago

While debugging #1519 I noticed that the symbol cache is very densely packed with gensyms, which made the hash collision that breaks symbol/slice far more likely than it "should" be. (The application that actually triggered that bug had to traverse 4072 full buckets during the lookup in question, even though I think the resizing tries to keep the cache only half full.)

repl:1:> (hash (gensym))
135496779
repl:2:> (hash (gensym))
135496780
repl:3:> (hash (gensym))
135496781
repl:4:> (hash (gensym))
135496782
repl:5:> (hash (gensym))
135496773
repl:6:> (hash (gensym))
135496774

A cheap workaround for this particular issue would be to hash strings in reverse order, which seems worth it given how much of the symbol cache consists of gensyms during compilation, but I didn't want to do that in case someone proposes an altogether better hash function.

bakpakin commented 1 week ago

A simple fix for this is to just add some extra hash mixing at the end of the string hash. We already do this for tuples and other structures to improve the hashing quality - for strings we are still just using a very basic DJB hash.

bakpakin commented 1 week ago

Pushed 5d1bd8a9320787c11b8eba10c78ad020840c85e6 that uses our existing janet_hash_mix routine to add the string length into the string hash as well a make sure gensyms are far apart.

sogaiu commented 1 week ago

It's my own fault but some of my tests seem to have been relying on internal details and they now fail with https://github.com/janet-lang/janet/commit/5d1bd8a9320787c11b8eba10c78ad020840c85e6.

As a specific example, as might be expected, the return value of things like pairs can be different from before.

With the change I now get:

$ janet
Janet 1.37.0-dev-5d1bd8a9 linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:b 2) (:a 1)]

before the change this was:

$ janet
Janet 1.37.0-dev-bafa6bff linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:a 1) (:b 2)]

Time to rewrite some tests perhaps (^^;


Update: tests have been updated. Found and fixed some tests that were problematic for other reasons so perhaps there was a net gain :)

ianthehenry commented 1 week ago

Running Bauble's test suite before and after the new hash function:

Benchmark 1: ~/bin/janet-bafa6bff jpm_tree/bin/judge
  Time (mean ± σ):     654.0 ms ±   3.8 ms    [User: 632.7 ms, System: 19.8 ms]
  Range (min … max):   648.5 ms … 657.7 ms    10 runs

Benchmark 2: ~/bin/janet-5d1bd8a9 jpm_tree/bin/judge
  Time (mean ± σ):     607.1 ms ±   2.2 ms    [User: 586.7 ms, System: 18.9 ms]
  Range (min … max):   604.1 ms … 610.9 ms    10 runs

Not bad!

sogaiu commented 1 week ago

Hurray for ⌊(2^32)⁄𝜙⌋.