new hash table - Githubissues

rurban commented 9 years ago

The old hash table uses:

linked lists for its collisions, with slow out-of-cache pointer chasing and data overhead.
unsorted flags at the end, while some flags are needed for compare.
has questionable security measures to slow down all cases. seed ok, randomize iter maybe, but randomize the collisions and slow hash funcs is stupid. the security should be fixed with proper implementations, not by pseudo-security theatre.

either use:

rurban/hash-sortbuckets: a sorted static list of collisions (up to 8, maybe start with 3, then 8) as in knuth sorted hash table,
khash: use open addressing as everyone else. faster, less space.
✓ PERTURB_KEYS_TOP move-to-front with a linked list is the only sane strategy for simple chained bucket lists.
✓ HV_FILL_RATE: try lower fill rates than 100%. 100% is pretty insane, esp. with our bad hash funcs. make it fast with builtin_ctz.
✓ use __builtin_ctz for faster division in DO_HSPLIT. Allow -DHV_FILL_RATE=90 definition. (Tested to be the best with FNV1A)
HE_ARRAY: According to http://goanna.cs.rmit.edu.au/~jz/fulltext/spire05.pdf the best for chained hashing is currently a cache-friendly array: cache-friendly continuous buffer of HE's w/ inlined HEK (char) + SV val, but no hash, no next ptr. Also for shared he's: PL_strtab.
✓ array_he: inline parts of the HE into the array. array_he vs ll_he. (linked list, see also the he-array branch). array_he (HvARRAY = AHE[]) can contain { hent_he, hent_hash }. this way the hash catches 99% of all comparisons already, and we don't have to chase the external hek ptr, when the hash check fails. every HE entry will then be 2 words (128), instead of one word (64), called AHE. The linked list still contains the old HE*, with { hent_next, hent_hek, hent_val }.
one-word-AHE: As possible improvement on that on 64bit use 48bits for the HE ptr, and 16bits of the hash to be compared first. See https://www.semanticscholar.org/paper/Fast-Dynamically-Sized-Concurrent-Hash-Table-Barnat-Rockai/ab7dd007587f411cf99bfe056639e055eff22e0c/pdf
use robin-hood as this is currently the best worse-case strategy (being super defensive, but not so stupid to use SipHash, which adds no benefit). with better threading support (shared hashes) eventually use leapfrog.
compact ordered hash. use an array of short indices into a compacted array of hash/key/val entries as in PyPy and now python: "compact ordered dict". This saves a lot of space and only add's one indirect lookup into cache-friendly array. See https://github.com/methane/cpython/pull/1 https://mail.python.org/pipermail/python-dev/2016-September/146327.html This also iterates over the hash in insertion order, which effectively hides any attempt to get the seed over the iterators. For attacks you need to get collision and robin-hood reordering timings.

maybe use the glibc htable (hsearch, hcreate), just add random seeds and flood detection (collision counts). (or khash, which is basically the same). coucal looks good also. https://github.com/xroche/coucal but best is probably preshing's http://preshing.com/20160314/leapfrog-probing/, which even can be made concurrent easily.

open addressing is cache friendly and much faster, but table growth is slow. potion uses khash, first double open, then quadratic open.

either way, first I need to abstract away the collision search in MACROS and reduce the needed comparisons from 4 to 1. (edit: after my critic, they improved that part to one ptr cmp) maybe we can normalize all keys to utf8 for faster checks. (probably not) utf8 is a HEK flag. See http://cpansearch.perl.org/src/RURBAN/illguts-0.49/index.html#hek The current implementation is just insanity and it only got worse over the years.

The last attempt was here: https://github.com/rurban/perl/commits/rurban/hash-sortbuckets See the featurex/gh24-*hash* branches. See also GH #102

plan

✓ -Dhash_func, add and test many more funcs
✓ PERTURB_KEYS_TOP
✓ HV_FILL_RATE
✓ d_builtin_ctz
✓ seperate hv_common_magic
✓ security: warn and protect against hash flood
✓ fix NODEFAULT_SHAREKEYS (#201)
✓ hash_loop: abstract loops and array_he
✓ array_he: seperate HvARRAY from collisions entries, add hent_hash
80% new_hash_table: merge HE fields for faster loop, usehekbitfield, d_little_endian.
one-word-AHE (hash+he on 64bit)
40% open_hash, compact_ordered: various strategies with the new abstractions

rurban commented 8 years ago

Merged now very simple, traditional improvements:

default HV_FILL_RATE 90%,
PERTURB_KEYS_TOP move-to-front

no hashtable rewrite yet.

rurban commented 8 years ago

Current problem: corrupt hek with tied hv. ./miniperl -Ilib -MConfig -e'print keys %Config'

The reason was that HVf_SVKEY(-2) was compared against HEK_LEN which is now an unsigned 31 bit value. Fixed with 746a63c91a306bc26713574ca67420bcd279758b

Fixed also an issue with a wrong HekTAINTED check with HeKSVKEY. There the sv is tainted or not, the hek has no flags. This 9e9ec088791b75fcaf5271a2ebb1c412f721ab2c needs to be backported.

rurban commented 8 years ago

With d58d80f40124ed394a72e5f41782c1ca86f26e7e found and fixed an important hsplit bug.

Detected with -DHv diffing and terminal broadcast input to two panes in parallel (iTerm2) in 2 gdb sessions. Very nice debugging this way to compare old vs new.

rurban commented 8 years ago

failing minitests with cperl-5.25.0-165-g0c508ac featurex/gh24-array_he

mpb t/op/local.t 108,229,232,285,294 t/op/lex_assign.t 281-304 t/op/smartmatch.t 134 ~~t/op/stash.t 5~~ (fixed with e3f6ad7161bdc4689b81cee207e5aa37a9653875) t/op/tie_fetch_count.t 75

../do-bench-mini: 2-9% faster

fixed with 868b47c6bb53b375089b52ab7a1c25b933335ffa fix deletion of first_entry

perl11 / cperl

new hash table #24

plan