Performance benchmarks with strings that are not multiple of 8 bytes

vstakhov commented 8 years ago

Hello, I'd like to thank you for your hard work - I really like your algorithm.

However, I have a sort of question: I tried mum-hash to calculate shingles which implies a lot of hash calculations for individual words in some arbitrary text. It turns out, that mum-hash performs quite bad in these conditions: it is slower than xxhash64:

mum-hash:

(50000 words of 5 len, 0.02 perm factor): percentage of common shingles: 0.968, generate time: 0.0533 sec
(50000 words of 16 len, 0.02 perm factor): percentage of common shingles: 0.937, generate time: 0.0351 sec

xxhash64:

(50000 words of 5 len, 0.02 perm factor): percentage of common shingles: 0.781, generate time: 0.0330 sec
(50000 words of 16 len, 0.02 perm factor): percentage of common shingles: 0.906, generate time: 0.0361 sec

As you can see, for inputs that are not 'good' for hash the performance is reduced significantly. I think this is worth to mention in the documentation.

vnmakarov commented 8 years ago

Thank you for for benchmarking MUM vs xxhash64. I am confused why the percent of common shingles are different for MUM and xxhash64.

I see that for len==5 MUM speed on your benchmark program is worse. I've just tried MUM and xxhash64 on my benchmark program for this length and I got 14.45s and 19.38s correspondingly for MUM and xxhash64. So it is hard for me to say why your benchmark program is slower with MUM for len==5 unless I have your benchmark program to play with.

MUM works with input by 8-byte blocks. The smaller input is memcopied (it should be memmove) into 8-byte block first. xxhash64 does not use mempcy (it use 4-byte block access and at most 3 access by 1 byte). So the difference might be because of different access for data < 8 bytes.

In any case I'll try to improve MUM performance for len < 8 byte on the next week.

Again thank you for your feedback.

vstakhov commented 8 years ago

Well, after that experiment, I've tried to copy all my data to an intermediate aligned buffer which length is multiple of 8 (filling the rest with zeroes). In that case, mumhash was faster indeed (not as significantly as in your experiments however). On the other hand, intermediate copying killed all benefits from faster hashing in that case so I've decided to dynamically switch functions depending on conditions (between mumhash, xxhash and metrohash or just metrohash/mumhash for target independent hashing). You could see the code here: https://github.com/vstakhov/rspamd/blob/master/src/libcryptobox/cryptobox.c#L1458

I suppose, that a difference in your benchmarks might come from inlining issues: what if xxhash code is not inlined whilst mumhash is? If you'd like to test my case, you could clone rspamd project, install dependencies (as described at https://rspamd.com/downloads/ ) and run make rspamd-test followed by test/rspamd-test -p /rspamd/shingles. Unfortunately, I don't have any isolated test case so far.

I also did some perf measurements of mum-hash comparing to xxhash for your bench code. For mum-hash case the hot place is main function meaning that all is inlined:

  0,04 │50:   mov    %rcx,%rax
  5,45 │      mul    %rbx
 16,35 │      lea    (%rdx,%rax,1),%rsi
  5,87 │      xor    %r11,%rsi
  5,18 │      mov    %rsi,%rax
  0,01 │      mul    %r10
 22,25 │      add    %rdx,%rax
  5,44 │      xor    %rsi,%rax
  5,51 │      mov    %rax,%rcx
       │      mul    %r9
 21,32 │      add    %rdx,%rax
  6,17 │      xor    %rcx,%rax
  6,41 │      sub    $0x1,%r8d
  0,00 │      mov    %rax,%rcx

For xxhash the situation is completely different:

  51,31%  a.out    a.out             [.] XXH64_digest
  17,15%  a.out    a.out             [.] XXH64_update
  14,47%  a.out    libc-2.22.so      [.] __memcpy_sse2_unaligned
   9,49%  a.out    a.out             [.] XXH64_reset
   5,66%  a.out    a.out             [.] main
   1,89%  a.out    a.out             [.] memcpy@plt

So I assume that the main difference is that mum-hash is inlined whilst xxhash is not. In my experiments both functions are NOT inlined.

vstakhov commented 8 years ago

I used g++ -DSPEED2 -O3 -w -fpermissive -DMUM -I../ bench.c for mum-hash and g++ -DSPEED2 -O3 -w -fpermissive -DxxHash -g -I../ bench.c xxhash.c for xxhash BTW.

vnmakarov commented 8 years ago

Inlining is important when you process small strings as call overhead code is significant in this case. That is why I put all code in a header file.

It was stupid of me not to optimize the tail processing (less 8 bytes). I've modified the code minimizing branches. The benchmark speed for 5-byte strings was improved 2 times. I also add the test for 5-bytes and put the results for it in README.md file.

So you can try the new version.

vstakhov commented 8 years ago

Well, I completely agree that inlining is important since function call has its own overhead even when omitting frame pointer. However, it is not always possible, especially when you use some 3rd party library which just accepts a hash function.

Nevertheless, my point was that your comparison was a bit unfair to xxhash: you literally compare inlined mum-hash with non-inlined xxhash. I have fixed this issue and will send a pull request a bit later.

I'd like to thank you for a quick fix: mumhash now seems to be much better on arbitrary inputs. I'll continue investigations in the meantime. In fact, I think about replacing xxhash with mumhash in my libucl library - I can use inlined version there I suppose.

vstakhov commented 8 years ago

By the way, after this change I have many warnings like this one:

/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:223:29: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *(uint32_t *) str << 32;
                            ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:228:29: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *(uint32_t *) str << 32;
                            ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:232:29: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *(uint32_t *) str << 32;
                            ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:236:29: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *(uint32_t *) str << 32;
                            ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:239:29: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *(uint16_t *) str << 48;
                            ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:240:19: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 |= str[2] << 40;
                  ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:243:29: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *(uint16_t *) str << 48;
                            ^  ~~
/Users/vstakhov/rspamd/src/libcryptobox/../../contrib/mumhash/mum.h:246:16: warning: shift count >= width of type [-Wshift-count-overflow]
    u64 = *str << 56;

Was it intended or not?

vstakhov commented 8 years ago

Hum, and hashing results are now different as well.

vnmakarov commented 8 years ago

Was it intended or not?

No, it was not intended. It is my mistake. Strange that my GCC did not give such warning. I'll fix it soon.

vnmakarov commented 8 years ago

Hum, and hashing results are now different as well.

It does not matter for hash tables but as I am providing MUM_TARGET_INDEPENDENT_HASH I should process the tail in a consistent way. So I am going to fix this too.

vstakhov commented 8 years ago

It does not matter for hash tables but as I am providing MUM_TARGET_INDEPENDENT_HASH I should process the tail in a consistent way.

I define this macro before including mum.h, so if you can make it really independent and consitent that will be much appreciated by me :) This code is not yet in production, so I can add changes to hashes now. Afterwards, hashes format changing would be extremely painful.

vnmakarov commented 8 years ago

Nevertheless, my point was that your comparison was a bit unfair to xxhash: you literally compare inlined mum-hash with non-inlined xxhash. I have fixed this issue and will send a pull request a bit later.

Thank you, Vsevolod. I'll incorporate most of your changes into the repository if you don't mind. Of course, ChangeLog entry will have your name.

I don't like an idea of changing xxHash sources for the benchmark. If xxHash author want better results, he should use the corresponding interface. If you want a fare comparison, i think it is better to use GCC/LLVM LTO optimizations (it will make cross file inlining). I'll try to implement it.

As for LLVM, as usually it pretends to be GCC without providing full GCC functionality. Using LLVM will hurt mum performance as it is not providing function mutiversioning but I'll include your LLVM related changes too.

Thank you again for the patch.

vstakhov commented 8 years ago

By the way, haswell target is only defined for gcc >= 4.9 as far as I see. Hence, I've modified the avx2 guard to:

#if defined(__x86_64__) && defined(__GNUC__) && (__GNUC__ >= 4) &&  (__GNUC_MINOR__ >= 9) && !defined(__clang__)

Anyway, I prefer runtime dispatching to compile time dispatching by compiling C code with some good compiler to assembly and selecting the exact version of code in runtime. Of course, it also kills inlining as side effect.

vnmakarov commented 8 years ago

By the way, haswell target is only defined for gcc >= 4.9 as far as I see.

Thanks, Vsevolod. I modified the code mum.h/mum512.h/mum-prng.h to deal with that. I've tested it on gcc4.4, gcc4.8, and gcc4.9 and also clang 3.5.

We could use -mavx2 as it was introduced in GCC-4.8. But the current code is better tuned for modern processors.

HashTang commented 8 years ago

Hello, I've been interested in testing your benchmark program. But I am surprised, because it seems to measure xxHash differently from other algorithms.

It invokes the heavier streaming API, requiring multiple calls :

static XXH64_state_t* state = NULL;

static void xxHash64_test(const void *key, int len, uint32_t seed, void *out) {
  if (! state) state = XXH64_createState ();
  XXH64_reset (state, seed);
  XXH64_update (state, key, len);
  *(uint64_t*)out = XXH64_digest (state);
}

It could have been much simpler :

static void xxHash64_test(const void *key, int len, uint32_t seed, void *out) {
  *(uint64_t*)out = XXH64(key, len, seed);
}

This is surprising because "the short version" is the one used for all other algorithms :

*(uint64*)out = CityHash64WithSeed((const char *)key,len,seed);
siphash(out, (const uint8_t *) key, len, (const uint8_t *) s);
MetroHash64::Hash ((const uint8_t *)key, len, (uint8_t *) out, seed);
*(uint64_t *)out = mum_hash (key, len, seed);

The streaming API is meant for very long content. It has an additional burden with state maintenance. Using it for short keys is overkill and just reduces performance.

Anyway, fixing this difference, the results are quite impacted. Similar to @vstakhov , I find xxHash to be 2x - 3x faster on short keys.

vnmakarov commented 8 years ago

It seems I overlooked that xxHash has a simpler interface. Thank you for pointing this out.

I've modified the code and updated xxHash speed numbers in README.md file for x86-64 and ppc64 (I have no aarch64 machine available now therefore I did not change the numbers for this target).

HashTang commented 8 years ago

Thanks. This seems in line with observations.

I have no aarch64 machine available now therefore I did not change the numbers for this target

A quick word explaining why the performance is so bad on this specific target would be a cleaner information (than just publishing wrong numbers).

vstakhov commented 8 years ago

JFYI, @bapt has tested mumhash in https://github.com/vstakhov/libucl on arm v6 with FreeBSD and he found that it is still faster than xxhash32. So I can say that mumhash is now the official hash used in FreeBSD pkg tool :)

vnmakarov commented 8 years ago

A quick word explaining why the performance is so bad on this specific target would be a cleaner information (than just publishing wrong numbers).

You are right. I ran the benchmark again on aarch64 and updated the results. As it is a bit different X-gene board/OS (a slower one), I updated the results for all hash functions for aarch64.

vnmakarov / mum-hash

Performance benchmarks with strings that are not multiple of 8 bytes #3