Open paulie-g opened 5 years ago
It's a bit late, so my apologies if I'm not making any sense. That particular fork of smhasher uses a different statistical technique for avalanche tests (compared to the appleby and rurban versions), which discovers more flaws. It also shows you graphically what these flaws look like. Mum v1 used to pass all of the avalanche tests, the v2 does not. I've just tested adding back the second final mixing stage with the second finish prime, which you ifdef'ed out for v2, and the tests pass again, although it does wipe out most of the performance improvement for len < 32 and all of it for len > 32.
Thank you for the info. I used appleby smhasher, the same version as for the first mum version. In fact, I spent a lot of time running smhasher when I tried to speed up the first mum-hash version.
I'll try demerphq/smhasher and will work on a new version faster 1st mum-hash version but still passing demerphq/smhasher.
Ah, that explains it. The appleby version is far behind the times, unfortunately, and there are lots of hashes that pass it but still have non-negligible flaws. People essentially over-optimize against the appleby/rurban smhasher versions for performance by finding improvements which introduce flaws those versions can't catch. Yves' (demerphq) smhasher fork is the most thorough standard currently available.
I've spent the last day tweaking mum v2 and managed to produce two tweaked versions that pass all avalanche tests. One keeps 90% of the performance gains of v2 at the cost of narrowly failing two collision tests on sparse keys (5 collisions above expected 0 collisions for up to 16 and 20 byte keys with only two bytes set - a failure I don't consider consequential for most use cases). The other keeps about 50% of the performance improvement and passes everything. I'm not submitting them as a pull request because I'm testing on Skylake-X now and all of the optimizations are likely heavily dependent on its various properties.
Thanks a lot for your work on mum by the way, I've enjoyed playing with it immensely. If you're curious, with my opts it's the fastest thing on Skylake-X followed closely by Fanom64 measured by demerphq/smhasher, although I expect a real-world test with variable-length keys messing with branch prediction to present surprises ;)
Out of curiosity, how do you generate your 'primes'? Is it by eye, trial and error, or algorithmic?
I've spent the last day tweaking mum v2 and managed to produce two tweaked versions that pass all avalanche tests. One keeps 90% of the performance gains of v2 at the cost of narrowly failing two collision tests on sparse keys (5 collisions above expected 0 collisions for up to 16 and 20 byte keys with only two bytes set - a failure I don't consider consequential for most use cases). The other keeps about 50% of the performance improvement and passes everything. I'm not submitting them as a pull request because I'm testing on Skylake-X now and all of the optimizations are likely heavily dependent on its various properties.
Making a hash function faster is a lot of fun. It would be interesting to see your Skylake-X optimizations. All my tries to vectorize mum-hash about 2 years ago failed.
Out of curiosity, how do you generate your 'primes'? Is it by eye, trial and error, or algorithmic?
They are sequential PRNs generated by GLIBC PRNG with some seed. Probably it would be better to use crypto-level PRNG but it would complicate mum-hash much.
Actually to make collision attacks even much harder, somebody can use mum_hash_randomize which will generate different PRNS for mum-hash. But using mum_hash_randomize makes mum-hash slower because it prevents constant propagation optimization.
Making a hash function faster is a lot of fun. It would be interesting to see your Skylake-X optimizations. All my tries to vectorize mum-hash about 2 years ago failed.
I'm not surprised - as I'm sure you know, vectorization is useful when you don't have dependency chains and can do a large enough amount of work to justify the set-up costs. There are lots of interesting and clever ways to use AVX2, AVX512 and AES-NI for hashing, but you'd have to design an algo for that specifically and it would still likely only be fast for rather long keys. Starting with mum, any successful use of SIMD would result in an algo that has nothing in common with what you started with :)
I think you might be a bit disappointed - my opts are mostly not me being particularly clever, it's more that I'm giving the compiler more information, allowing it to schedule insns optimally and avoid port contention. Compared to Ivy Bridge, which I had before, and even Haswell, Skylake-X has lots of interesting features under the hood that give the compiler a lot of scope for doing that (see Skylake-X/SP uarch if interested in details). Secondly, making dependency chains shorter while doing comparable work improves insn parallelism.
Everything I've done is in _mum_final(). In mum v1, you do two dependent rounds of _mum() + xor mix into state. In mum v2, you replace that with a cyclical shift + xor mix-in and a dependent _mum() + xor mix-in. We know that the latter fails avalanche quite badly, so we clearly need another round of _mum(). However, we don't want to pay full price for it. The solution is to split up the long dependency chain in two:
_mum_final (uint64_t h) {
uint64_t v1, v2;
v1= _mum(h, _mum_finish_prime1);
v2 = _mum(h, _mum_finish_prime2);
return h ^ _mum(v2, v1);
}
This still does a lot of work, but the two MULX ops are no longer dependent on each other, so we only pay 1 cycle for the second one, rather than 4.
It then gets a bit quicker if you do:
return h + _mum(v2, v1);
At this point, we're at ~18.xx cycles/hash for keylen < 32 in smhasher vs ~17.xx for mum v2. We no longer fail avalanche tests, but we still fail a couple of TwoBytes sparse key tests. On their own, I don't think they matter for most use cases, but I believe they might indicate this still isn't enough work. So we go to:
_mum_final (uint64_t h) {
__uint128_t v1, v2, v3, v4;
v1 = (__uint128_t) h * (__uint128_t) _mum_finish_prime1;
v2 = (__uint128_t) h * (__uint128_t) _mum_finish_prime2;
v4 = (__uint128_t) h * (__uint128_t) _mum_tail_prime;
v1 = (uint64_t) v1 + (uint64_t)(v1 >> 64);
v2 = (uint64_t)v2 + (uint64_t)(v2 >> 64);
v3 = (v1 * v2) + v4;
return (uint64_t)v3 + (uint64_t)(v3 >> 64);
}
Two things happening here. Firstly, we've added another multiplication, but it's not dependent on anything else, so we only pay 1 cycle for it at most and it should overlap nicely with the hi + lo fold. On Skylake-X they don't compete for ports, whereas on Ivy Bridge they might have (don't quote me on that ;). Just adding this mul without the hi + lo fold appears to be just enough additional work and now everything passes. Secondly, we're now doing _mum() explicitly rather than using an inline. I know the GCC docs enthuse about how static inlines aren't slower than macros, but in this case it gets us an extra 1 cycle/hash. I haven't investigated why that is, but I suspect that GCC (8.2.1 here) does a better job with either insn scheduling or registers when it has all the info up front (again, I might be wrong since I haven't examined the differences in asm output side by side), but the effect is there. Clang might not have this problem because it has less rigid optimization pass sequencing, but I haven't tested that. An additional factor might be that we're not reusing variables outside their dependency chain, which gives the compiler more info (as is usual by now, I haven't confirmed this in this case, but it's good practice in general). All in all, this gets us down to ~19.xx cycles/hash for keylen < 32, with none taking more than 22 cycles.
Now I hope you can see why I didn't suggest including these changes. There's every chance they'll make things worse on a lot of consumer machines people use, they definitely clobber all pretense of portability and would ruin performance on machines/compilers without mulx/__int128_t. Most of this can be addressed in one way or another, but I haven't done it yet :)
Like I said, hardly earth-shattering or particularly clever. My intuition says that we're getting very close to the bottom threshold for the number of cycles it can possibly take to do enough work so the returns from further tweaking will be diminishing pretty fast. I expect the branch mispredictions in real world use with variable length keys to dominate any performance gains made beyond this, so I'll be putting together a bench to simulate it and tweaking to improve that.
However, there's still lots of interesting scope for improvement. Doing a whole-body optimization to improve insn paralellism and scheduling might do something. So might testing other values for __mum_finish_prime{1,2} - maybe, with the right values, we can replace the third multiplication and attendant mix-in via addition with something shorter - I seem to recall finding a permutation that only failed avalanche on bits 62 and 63, but couldn't find a good way to address it that beat the performance of the nearly-free third independent multiplication. There are interesting things to try by doing more work in the loop rather than the finalization stage, as counter-intuitive as that might seem. The primes[i] and _mum_unroll_prime _mum() round also look like there's some juice left in them provided they don't turn out to be ~free when I actually test it ;). I certainly look forward to trying it all out, although I'm playing with this stuff as a distraction when I procrastinate so not sure when that'll happen exactly.
Out of curiosity, how do you generate your 'primes'? Is it by eye, trial and error, or algorithmic?
They are sequential PRNs generated by GLIBC PRNG with some seed. Probably it would be better to use crypto-level PRNG but it would complicate mum-hash much.
The reason I ask is I'm very curious to know whether you've seen much variability in avalanche tests depending on your choice of 'primes'. I know this choice is important to Metro/Fanom/etc-like hash funcs, but is it as important here? In other words, what are the chances of finding a set of finalization primes that'll let us do less work in _mum_final()? I'm not keen on doing this by hand, and I don't want to put in the work to do it algorithmically if there's no good reason to expect it to work.
Making a hash function faster is a lot of fun.
.. and yes, completely agree with you there. Although, in practical terms, mum, fanom and t1ha2_atonce are already 'good enough'. Improving them further is the definition of what we used to call 'доить/дрочить енота' in the fidonet era ;)
I plugged in your v2 code into demerphq/smhasher and, unlike v1, it's failing most of its test suite. Have you tested it with that smhasher fork by any chance? It doesn't make sense to me that this would happen to the extent it is, so it may be something to do with the integration; I'll check that next unless you have an idea of what's going on.