Open vstakhov opened 8 years ago
Thank you for for benchmarking MUM vs xxhash64. I am confused why the percent of common shingles are different for MUM and xxhash64.
I see that for len==5 MUM speed on your benchmark program is worse. I've just tried MUM and xxhash64 on my benchmark program for this length and I got 14.45s and 19.38s correspondingly for MUM and xxhash64. So it is hard for me to say why your benchmark program is slower with MUM for len==5 unless I have your benchmark program to play with.
MUM works with input by 8-byte blocks. The smaller input is memcopied (it should be memmove) into 8-byte block first. xxhash64 does not use mempcy (it use 4-byte block access and at most 3 access by 1 byte). So the difference might be because of different access for data < 8 bytes.
In any case I'll try to improve MUM performance for len < 8 byte on the next week.
Again thank you for your feedback.
Well, after that experiment, I've tried to copy all my data to an intermediate aligned buffer which length is multiple of 8 (filling the rest with zeroes). In that case, mumhash was faster indeed (not as significantly as in your experiments however). On the other hand, intermediate copying killed all benefits from faster hashing in that case so I've decided to dynamically switch functions depending on conditions (between mumhash, xxhash and metrohash or just metrohash/mumhash for target independent hashing). You could see the code here: https://github.com/vstakhov/rspamd/blob/master/src/libcryptobox/cryptobox.c#L1458
I suppose, that a difference in your benchmarks might come from inlining issues: what if xxhash code is not inlined whilst mumhash is? If you'd like to test my case, you could clone rspamd project, install dependencies (as described at https://rspamd.com/downloads/ ) and run make rspamd-test
followed by test/rspamd-test -p /rspamd/shingles
. Unfortunately, I don't have any isolated test case so far.
I also did some perf
measurements of mum-hash
comparing to xxhash
for your bench
code. For mum-hash
case the hot place is main
function meaning that all is inlined:
0,04 │50: mov %rcx,%rax
5,45 │ mul %rbx
16,35 │ lea (%rdx,%rax,1),%rsi
5,87 │ xor %r11,%rsi
5,18 │ mov %rsi,%rax
0,01 │ mul %r10
22,25 │ add %rdx,%rax
5,44 │ xor %rsi,%rax
5,51 │ mov %rax,%rcx
│ mul %r9
21,32 │ add %rdx,%rax
6,17 │ xor %rcx,%rax
6,41 │ sub $0x1,%r8d
0,00 │ mov %rax,%rcx
For xxhash the situation is completely different:
51,31% a.out a.out [.] XXH64_digest
17,15% a.out a.out [.] XXH64_update
14,47% a.out libc-2.22.so [.] __memcpy_sse2_unaligned
9,49% a.out a.out [.] XXH64_reset
5,66% a.out a.out [.] main
1,89% a.out a.out [.] memcpy@plt
So I assume that the main difference is that mum-hash
is inlined whilst xxhash is not. In my experiments both functions are NOT inlined.
I used g++ -DSPEED2 -O3 -w -fpermissive -DMUM -I../ bench.c
for mum-hash and g++ -DSPEED2 -O3 -w -fpermissive -DxxHash -g -I../ bench.c xxhash.c
for xxhash BTW.
Inlining is important when you process small strings as call overhead code is significant in this case. That is why I put all code in a header file.
It was stupid of me not to optimize the tail processing (less 8 bytes). I've modified the code minimizing branches. The benchmark speed for 5-byte strings was improved 2 times. I also add the test for 5-bytes and put the results for it in README.md file.
So you can try the new version.
Well, I completely agree that inlining is important since function call has its own overhead even when omitting frame pointer. However, it is not always possible, especially when you use some 3rd party library which just accepts a hash function.
Nevertheless, my point was that your comparison was a bit unfair to xxhash: you literally compare inlined mum-hash with non-inlined xxhash. I have fixed this issue and will send a pull request a bit later.
I'd like to thank you for a quick fix: mumhash now seems to be much better on arbitrary inputs. I'll continue investigations in the meantime. In fact, I think about replacing xxhash with mumhash in my libucl
library - I can use inlined version there I suppose.
By the way, after this change I have many warnings like this one:
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:223:29: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *(uint32_t *) str << 32;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:228:29: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *(uint32_t *) str << 32;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:232:29: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *(uint32_t *) str << 32;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:236:29: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *(uint32_t *) str << 32;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:239:29: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *(uint16_t *) str << 48;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:240:19: warning: shift count >= width of type [-Wshift-count-overflow]
u64 |= str[2] << 40;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:243:29: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *(uint16_t *) str << 48;
^ ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:246:16: warning: shift count >= width of type [-Wshift-count-overflow]
u64 = *str << 56;
Was it intended or not?
Hum, and hashing results are now different as well.
Was it intended or not?
No, it was not intended. It is my mistake. Strange that my GCC did not give such warning. I'll fix it soon.
Hum, and hashing results are now different as well.
It does not matter for hash tables but as I am providing MUM_TARGET_INDEPENDENT_HASH I should process the tail in a consistent way. So I am going to fix this too.
It does not matter for hash tables but as I am providing MUM_TARGET_INDEPENDENT_HASH I should process the tail in a consistent way.
I define this macro before including mum.h
, so if you can make it really independent and consitent that will be much appreciated by me :) This code is not yet in production, so I can add changes to hashes now. Afterwards, hashes format changing would be extremely painful.
Nevertheless, my point was that your comparison was a bit unfair to xxhash: you literally compare inlined mum-hash with non-inlined xxhash. I have fixed this issue and will send a pull request a bit later.
Thank you, Vsevolod. I'll incorporate most of your changes into the repository if you don't mind. Of course, ChangeLog entry will have your name.
I don't like an idea of changing xxHash sources for the benchmark. If xxHash author want better results, he should use the corresponding interface. If you want a fare comparison, i think it is better to use GCC/LLVM LTO optimizations (it will make cross file inlining). I'll try to implement it.
As for LLVM, as usually it pretends to be GCC without providing full GCC functionality. Using LLVM will hurt mum performance as it is not providing function mutiversioning but I'll include your LLVM related changes too.
Thank you again for the patch.
By the way, haswell target is only defined for gcc >= 4.9 as far as I see. Hence, I've modified the avx2
guard to:
#if defined(__x86_64__) && defined(__GNUC__) && (__GNUC__ >= 4) && (__GNUC_MINOR__ >= 9) && !defined(__clang__)
Anyway, I prefer runtime dispatching to compile time dispatching by compiling C code with some good compiler to assembly and selecting the exact version of code in runtime. Of course, it also kills inlining as side effect.
By the way, haswell target is only defined for gcc >= 4.9 as far as I see.
Thanks, Vsevolod. I modified the code mum.h/mum512.h/mum-prng.h to deal with that. I've tested it on gcc4.4, gcc4.8, and gcc4.9 and also clang 3.5.
We could use -mavx2 as it was introduced in GCC-4.8. But the current code is better tuned for modern processors.
Hello, I've been interested in testing your benchmark program. But I am surprised, because it seems to measure xxHash differently from other algorithms.
It invokes the heavier streaming API, requiring multiple calls :
static XXH64_state_t* state = NULL;
static void xxHash64_test(const void *key, int len, uint32_t seed, void *out) {
if (! state) state = XXH64_createState ();
XXH64_reset (state, seed);
XXH64_update (state, key, len);
*(uint64_t*)out = XXH64_digest (state);
}
It could have been much simpler :
static void xxHash64_test(const void *key, int len, uint32_t seed, void *out) {
*(uint64_t*)out = XXH64(key, len, seed);
}
This is surprising because "the short version" is the one used for all other algorithms :
*(uint64*)out = CityHash64WithSeed((const char *)key,len,seed);
siphash(out, (const uint8_t *) key, len, (const uint8_t *) s);
MetroHash64::Hash ((const uint8_t *)key, len, (uint8_t *) out, seed);
*(uint64_t *)out = mum_hash (key, len, seed);
The streaming API is meant for very long content. It has an additional burden with state maintenance. Using it for short keys is overkill and just reduces performance.
Anyway, fixing this difference, the results are quite impacted. Similar to @vstakhov , I find xxHash to be 2x - 3x faster on short keys.
It seems I overlooked that xxHash has a simpler interface. Thank you for pointing this out.
I've modified the code and updated xxHash speed numbers in README.md file for x86-64 and ppc64 (I have no aarch64 machine available now therefore I did not change the numbers for this target).
Thanks. This seems in line with observations.
I have no aarch64 machine available now therefore I did not change the numbers for this target
A quick word explaining why the performance is so bad on this specific target would be a cleaner information (than just publishing wrong numbers).
JFYI, @bapt has tested mumhash in https://github.com/vstakhov/libucl on arm v6 with FreeBSD and he found that it is still faster than xxhash32. So I can say that mumhash is now the official hash used in FreeBSD pkg
tool :)
A quick word explaining why the performance is so bad on this specific target would be a cleaner information (than just publishing wrong numbers).
You are right. I ran the benchmark again on aarch64 and updated the results. As it is a bit different X-gene board/OS (a slower one), I updated the results for all hash functions for aarch64.
Hello, I'd like to thank you for your hard work - I really like your algorithm.
However, I have a sort of question: I tried mum-hash to calculate shingles which implies a lot of hash calculations for individual words in some arbitrary text. It turns out, that mum-hash performs quite bad in these conditions: it is slower than xxhash64:
mum-hash:
xxhash64:
As you can see, for inputs that are not 'good' for hash the performance is reduced significantly. I think this is worth to mention in the documentation.