vasanthaganeshk / unladen-swallow

Python2 Jit compiler
https://code.google.com/archive/p/unladen-swallow/
Other
0 stars 0 forks source link

Evaluate other string hash algorithms #82

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
During an internal discussion about hash techniques, it was suggested that 
Python might benefit from changing its string hashing algorithm; murmur 
(http://murmurhash.googlepages.com/) was specifically mentioned as a good 
candidate for evaluation.

Python strings cache their hash values, but an improvement in collision rates 
may well be visible, given how central dicts are to Python's implementation.

Original issue reported on code.google.com by collinw on 18 Sep 2009 at 7:59

GoogleCodeExporter commented 8 years ago

Thanks for turning me on to "murmur"!  This has generated no small amount of 
interest 
at work--it could be a boon to memcached.

But please note: Python's string hashing function has been tuned to Python's 
workload, particularly wrt dicts.  For example, strings where you "increment" 
the 
last character ("abc" to "abd", or "100" to "101") hash to the same value + 1.  
This 
would normally be considered a terrible hash algorithm as its avalanching is so 
rotten.  But it works well for Python dicts because it gives '"better than 
random" 
dict collision behavior for input strings very close together'.  I cite Tim 
Peters:
  http://www.mail-archive.com/python-3000@python.org/msg10194.html
See also the big comment near the top of "dictobject.c".

Not that you shouldn't experiment!  Just that you should tread lightly and 
benchmark 
the hell out of it.

Cheers,

/larry/

Original comment by exoticma...@gmail.com on 20 Sep 2009 at 11:37

GoogleCodeExporter commented 8 years ago
Right, we believe the current hash/hashtable is good, but it's always worth 
experimenting, especially as new hash techniques are developed. I've seen the 
""namea", "nameb", "namec", and "named" hash very closely, which means X, Y and 
Z" 
example in several places, but I'd be curious how many performance-critical 
dicts have 
such keys with those characteristics.

Original comment by collinw on 21 Sep 2009 at 6:57

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 6 Jan 2010 at 11:43

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 19 Feb 2010 at 9:43