Closed wolfpld closed 4 years ago
Should be ready for merge.
thanks!
I just noticed a regression on my Intel i7-8700 with that PR. The first benchmark executed with -tc=bench_quick_overall_map_flat
, which does mostly iterating, has degraded from 11.33 op/sec to 7.62 op/sec. So it seems intel is slowed down quite a bit with _tzcnt_u64
.
Is there a way to detect ZEN architecture with a macro?
That's quite a hit :( I don't think there is a way to determine target arch using macros. Maybe let's make something that user can define themselves?
Where can you find this benchmark?
I compiled this repository with this:
cmake -G Ninja -DCMAKE_CXX_FLAGS="-O3 -march=native" -DRH_cxx_standard=17 -DCMAKE_BUILD_TYPE=Release ../..
ninja
./rh -ns -tc=bench_quick_overall_map_flat
This executes a few benchmarks, the first one relies heavily on the CLZ operation., which is done in the fastForward()
These are my results on Ryzen 3900X:
gcc on wsl:
bsf:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 52,786,200.00 | 18.94 | 0.2% | 0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
| 83,493,200.00 | 11.98 | 0.2% | 0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
| 48,770,500.00 | 20.50 | 0.2% | 0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
| 56,677,300.00 | 17.64 | 0.1% | 0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 444,000,500.00 | 2.25 | 0.1% | 4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 370,457,500.00 | 2.70 | 0.1% | 4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.112282
tzcnt:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 81,490,500.00 | 12.27 | 0.1% | 0.90 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
| 83,758,900.00 | 11.94 | 0.2% | 0.92 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
| 49,213,700.00 | 20.32 | 0.1% | 0.54 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
| 88,954,600.00 | 11.24 | 0.2% | 0.98 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 452,119,900.00 | 2.21 | 0.1% | 4.97 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 371,017,100.00 | 2.70 | 0.1% | 4.08 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.13082
MSVC:
bsf:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 104,920,500.00 | 9.53 | 0.2% | 1.15 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
| 77,855,300.00 | 12.84 | 0.1% | 0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
| 62,434,500.00 | 16.02 | 0.1% | 0.69 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
| 109,994,600.00 | 9.09 | 0.1% | 1.21 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 587,536,800.00 | 1.70 | 1.0% | 6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 410,914,400.00 | 2.43 | 0.3% | 4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.154391
tzcnt:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 86,302,500.00 | 11.59 | 0.1% | 0.95 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
| 78,075,200.00 | 12.81 | 0.2% | 0.86 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
| 62,216,300.00 | 16.07 | 0.1% | 0.68 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
| 90,891,800.00 | 11.00 | 0.2% | 1.00 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 602,306,000.00 | 1.66 | 1.0% | 6.60 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 408,184,100.00 | 2.45 | 0.0% | 4.49 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.145191
This is very interesting. I was originally looking only at the MSVC results (this is what I can currently profile). It seems that __builtin_ctzll
is doing something completely different, which is worthy of investigation and maybe backporting for MSVC.
Can you update and try this benchmark again? I've fixed this regression for me. Most of the time the whole 8 bytes are empty, so I don't always need to do the CLZ.
gcc:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 52,687,700.00 | 18.98 | 0.1% | 0.58 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
| 130,305,400.00 | 7.67 | 0.2% | 1.43 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
| 62,951,900.00 | 15.89 | 0.1% | 0.69 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
| 56,682,300.00 | 17.64 | 0.2% | 0.62 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 444,239,600.00 | 2.25 | 0.0% | 4.88 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 365,043,000.00 | 2.74 | 0.3% | 4.02 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.125849
MSVC:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 61,015,900.00 | 16.39 | 0.2% | 0.67 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> iterate while adding then removing`
| 135,066,500.00 | 7.40 | 0.9% | 1.49 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> random insert erase`
| 89,823,200.00 | 11.13 | 0.1% | 0.99 | `<robin_hood::unordered_flat_map<uint64_t, uint64_t>> 50% probability to find`
| 65,740,500.00 | 15.21 | 0.2% | 0.72 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 586,923,200.00 | 1.70 | 0.6% | 6.46 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 409,865,900.00 | 2.44 | 0.1% | 4.51 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.150686
Is the speed decrease in other tests caused by the new hash?
Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency? How did you compile with gcc?
Some more thoughts on that. I have checked fastForward()
in compiler explorer and the code generated on gcc/clang is identical, regardless if _tzcnt_u64
, or __builtin_ctzll
is called. On MSVC the code differs.
While this is true when the function is looked at in isolation, putting things into more open context may open up some more optimization possibilities. clang is very good at it, MSVC is very bad at it. I think that doing the zero check beforehand could be one of these optimizations, which the compiler would do in __builtin_ctzll
case, but didn't in the _tzcnt_u64
one. Overall the change is good, as it improves performance on MSVC.
Yes, but it made far less of a difference on my machine... Did you lock the cpu to a fixed frequency?
No. Didn't see a significant difference over multiple runs.
How did you compile with gcc?
Using the exact commands you pasted above.
I've again replaced the hash, this time with hardware CRC. On my PC it's about the same speed as the old hash. I think the benchmarks are not optimal though.
I think you forgot to push the changes.
Also, have you considered using xxh3? There's a very nice discussion of speed and latency of the algorithm, even with small inputs: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
xxh3 excellent, but pretty slow for just mixing 64bit values. I've done some work on finding a fast mixer here: https://github.com/martinus/better-faster-stronger-mixer
gcc:
| ns/op | op/s | err% | total | benchmarking
|--------------------:|--------------------:|--------:|----------:|:-------------
| 56,421,100.00 | 17.72 | 0.3% | 0.62 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> iterate while adding then removing`
| 93,140,400.00 | 10.74 | 0.1% | 1.03 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> random insert erase`
| 60,295,500.00 | 16.58 | 0.2% | 0.66 | `<robin_hood::unordered_flat_map<uint64_t, size_t>> 50% probability to find`
| 57,060,000.00 | 17.53 | 0.3% | 0.63 | `<robin_hood::unordered_flat_map<std::string, size_t>> iterate while adding then removing`
| 454,988,800.00 | 2.20 | 0.2% | 5.01 | `<robin_hood::unordered_flat_map<std::string, size_t>> random insert erase`
| 369,313,600.00 | 2.71 | 0.2% | 4.06 | `<robin_hood::unordered_flat_map<std::string, size_t>> 50% probability to find`
0.120346
MSVC doesn't compile.
__SSE4_2__
is not defined by MSVC.
Details in #73.