martinus / robin-hood-hashing

Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20
https://gitter.im/martinus/robin-hood-hashing
MIT License
1.5k stars 142 forks source link

Use tzcnt instead of bsf for better performance on ZEN/ZEN2. #74

Closed wolfpld closed 4 years ago

wolfpld commented 4 years ago

Details in #73.

wolfpld commented 4 years ago

Should be ready for merge.

martinus commented 4 years ago

thanks!

martinus commented 4 years ago

I just noticed a regression on my Intel i7-8700 with that PR. The first benchmark executed with -tc=bench_quick_overall_map_flat, which does mostly iterating, has degraded from 11.33 op/sec to 7.62 op/sec. So it seems intel is slowed down quite a bit with _tzcnt_u64.

Is there a way to detect ZEN architecture with a macro?

wolfpld commented 4 years ago

That's quite a hit :( I don't think there is a way to determine target arch using macros. Maybe let's make something that user can define themselves?

Where can you find this benchmark?

martinus commented 4 years ago

I compiled this repository with this:

cmake -G Ninja -DCMAKE_CXX_FLAGS="-O3 -march=native" -DRH_cxx_standard=17 -DCMAKE_BUILD_TYPE=Release ../..
ninja
./rh -ns -tc=bench_quick_overall_map_flat

This executes a few benchmarks, the first one relies heavily on the CLZ operation., which is done in the fastForward()

wolfpld commented 4 years ago

These are my results on Ryzen 3900X:

gcc on wsl:

bsf:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       52,786,200.00 |               18.94 |    0.2% |      0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       83,493,200.00 |               11.98 |    0.2% |      0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       48,770,500.00 |               20.50 |    0.2% |      0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       56,677,300.00 |               17.64 |    0.1% |      0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      444,000,500.00 |                2.25 |    0.1% |      4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      370,457,500.00 |                2.70 |    0.1% |      4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.112282

tzcnt:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       81,490,500.00 |               12.27 |    0.1% |      0.90 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       83,758,900.00 |               11.94 |    0.2% |      0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       49,213,700.00 |               20.32 |    0.1% |      0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       88,954,600.00 |               11.24 |    0.2% |      0.98 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      452,119,900.00 |                2.21 |    0.1% |      4.97 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      371,017,100.00 |                2.70 |    0.1% |      4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.13082

MSVC:

bsf:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|      104,920,500.00 |                9.53 |    0.2% |      1.15 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|       77,855,300.00 |               12.84 |    0.1% |      0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       62,434,500.00 |               16.02 |    0.1% |      0.69 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|      109,994,600.00 |                9.09 |    0.1% |      1.21 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      587,536,800.00 |                1.70 |    1.0% |      6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      410,914,400.00 |                2.43 |    0.3% |      4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.154391

tzcnt:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       86,302,500.00 |               11.59 |    0.1% |      0.95 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|       78,075,200.00 |               12.81 |    0.2% |      0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       62,216,300.00 |               16.07 |    0.1% |      0.68 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|       90,891,800.00 |               11.00 |    0.2% |      1.00 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      602,306,000.00 |                1.66 |    1.0% |      6.60 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      408,184,100.00 |                2.45 |    0.0% |      4.49 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.145191

This is very interesting. I was originally looking only at the MSVC results (this is what I can currently profile). It seems that __builtin_ctzll is doing something completely different, which is worthy of investigation and maybe backporting for MSVC.

martinus commented 4 years ago

Can you update and try this benchmark again? I've fixed this regression for me. Most of the time the whole 8 bytes are empty, so I don't always need to do the CLZ.

wolfpld commented 4 years ago

gcc:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       52,687,700.00 |               18.98 |    0.1% |      0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|      130,305,400.00 |                7.67 |    0.2% |      1.43 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       62,951,900.00 |               15.89 |    0.1% |      0.69 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       56,682,300.00 |               17.64 |    0.2% |      0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      444,239,600.00 |                2.25 |    0.0% |      4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      365,043,000.00 |                2.74 |    0.3% |      4.02 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.125849

MSVC:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       61,015,900.00 |               16.39 |    0.2% |      0.67 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
|      135,066,500.00 |                7.40 |    0.9% |      1.49 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
|       89,823,200.00 |               11.13 |    0.1% |      0.99 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
|       65,740,500.00 |               15.21 |    0.2% |      0.72 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      586,923,200.00 |                1.70 |    0.6% |      6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      409,865,900.00 |                2.44 |    0.1% |      4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.150686

Is the speed decrease in other tests caused by the new hash?

martinus commented 4 years ago

Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency? How did you compile with gcc?

wolfpld commented 4 years ago

Some more thoughts on that. I have checked fastForward() in compiler explorer and the code generated on gcc/clang is identical, regardless if _tzcnt_u64, or __builtin_ctzll is called. On MSVC the code differs.

While this is true when the function is looked at in isolation, putting things into more open context may open up some more optimization possibilities. clang is very good at it, MSVC is very bad at it. I think that doing the zero check beforehand could be one of these optimizations, which the compiler would do in __builtin_ctzll case, but didn't in the _tzcnt_u64 one. Overall the change is good, as it improves performance on MSVC.

Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency?

No. Didn't see a significant difference over multiple runs.

How did you compile with gcc?

Using the exact commands you pasted above.

martinus commented 4 years ago

I've again replaced the hash, this time with hardware CRC. On my PC it's about the same speed as the old hash. I think the benchmarks are not optimal though.

wolfpld commented 4 years ago

I think you forgot to push the changes.

Also, have you considered using xxh3? There's a very nice discussion of speed and latency of the algorithm, even with small inputs: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

martinus commented 4 years ago

xxh3 excellent, but pretty slow for just mixing 64bit values. I've done some work on finding a fast mixer here: https://github.com/martinus/better-faster-stronger-mixer

wolfpld commented 4 years ago

gcc:

|               ns/op |                op/s |    err% |     total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
|       56,421,100.00 |               17.72 |    0.3% |      0.62 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
|       93,140,400.00 |               10.74 |    0.1% |      1.03 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
|       60,295,500.00 |               16.58 |    0.2% |      0.66 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
|       57,060,000.00 |               17.53 |    0.3% |      0.63 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
|      454,988,800.00 |                2.20 |    0.2% |      5.01 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
|      369,313,600.00 |                2.71 |    0.2% |      4.06 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.120346

MSVC doesn't compile.

__SSE4_2__ is not defined by MSVC.