perl11 / cperl

A perl5 with classes, types, compilable, company friendly, security
http://perl11.org/
Other
142 stars 17 forks source link

small hash #102

Open rurban opened 8 years ago

rurban commented 8 years ago

optimize hashes with <= 3-5 keys to a simple array of keys and values with linear lookup.

HvSMALL(hv) / XHvSMALL(xhv) is either checking HvMAX < 7, or a flag. If a flag the very first HE* entry needs to be a non-ptr tag (& 0x1). We'd need a flag with inlined HEs and overlong keys, to omit HvSMALL optims with such long keys. We cannot the hv_aux based HvFLAGS with normal HvSMALL hashes, esp. when inlined.

The best would be a he-array alike inlined len/char/flags/val array to be cache concious. (as in #24 feature/gh24-he-array). The len really should be run-length encoded, then the flags needed for hash cmp need to come first. However at first we start with simple HE* arrays. (array of ptrs, not values) The last array element needs to have an NULL sentinel, so we cannot use all 7 HE, only 6.

But there are many more simple hash optims, which we do first.

wollmers commented 8 years ago

I estimate you can use linear or serial (unsorted) lookup up to 100 keys or even more, depending on benchmarks.

In my port of LCS::BV from Perl to C I began with Bob Jenkins hash and ended the tuning using VLAs (variable length arrays) on the stack, the array serially filled (\0 terminated). See llcs_seq_a() and the used hash_setpos() and hash_getpos(). With Bob Jenkins I get 250 kHz (cases per second) on i5@1500, with serial VLAs 7.5 MHz, thus factor 30x. The calloc variant llcs_seq() comes at 4 MHz.

Of course in my example I can benefit from the known restrictions: maximum size, keys strings immutable, typed values (uint_64).

rurban commented 8 years ago

So many? I thought I only want to fill one cache line, so just very few keys. But I'll benchmark it soon, when I got more time. Other langs tested 3-5, if I remember.

On Wed, Apr 27, 2016, 20:59 Helmut Wollmersdorfer notifications@github.com wrote:

I estimate you can use linear or serial (unsorted) lookup up to 100 keys or even more, depending on benchmarks.

In my port of LCS::BV https://github.com/wollmers/LCS-BV/tree/master/lib/LCS from Perl to C I began with Bob Jenkins hash and ended the tuning using VLAs (variable length arrays) on the stack, the array serially filled (\0 terminated). See llcs_seq_a() https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c#L105 and the used hash_setpos() and hash_getpos(). With Bob Jenkins I get 250 kHz (cases per second) on i5@1500, with serial VLAs 7.5 MHz, thus factor 30x. The calloc variant llcs_seq() https://github.com/wollmers/c-lcs-bv/blob/master/lcstest.c#L144 comes at 4 MHz.

Of course in my example I can benefit from the known restrictions: maximum size, keys strings immutable, typed values (uint_64).

— You are receiving this because you were assigned. Reply to this email directly or view it on GitHub https://github.com/perl11/cperl/issues/102#issuecomment-215193606

wollmers commented 8 years ago

You should trust only numbers you benchmarked yourself;-)

Hash is said to have complexity O(1). But as always it is O(1*k), where k is the implementation factor.

Serial has O((n/2)k). A break even point of n=4 between hash and serial would need k_hash = 2 \ k_serial. I.e. the hash algorithm executes only the double amount of instructions compared to one iteration of the loop of serial. My serial has 3 instructions (C operators) in the loop including conditions. So for a break even n=4 it would need a hash function (locating the entry in the array) to only use 6 instructions.

I didn't optimize for cache friendlyness directly. Serial just maps a nearly indefinite (sparse) alphabet to a minimal one (none sparse) and keeps nearly the order of filling, which is memory and cache friendly. Hash algorithms (if not perfect hashes) map sparse to not so sparse, but still sparse.

rurban commented 8 years ago

I went with 7 because this is the initial calloced size. But it doesn't work yet, so I cannot benchmark it.

rurban commented 8 years ago

Merged the 3 first parts (ctz, hv_common_magical, pre-extend). In the end it was much faster than expected. On my linux gcc-6.1/i5-2300 it was 8-14% faster in perlbench-run, on my darwin gcc-6-lto/i7-4650U 6% faster.

Having the magical code seperated and abstracted away will also help in the future hash rewrites.