Open AlexGuteniev opened 3 years ago
https://github.com/microsoft/STL/pull/686#issuecomment-609261700 mentions SipHash-2-4
I did some benchmarks based on https://github.com/rurban/smhasher
The test involves putting numbers 0 to 127 into a std::unordered_map<uint32_t, uint32_t>
, choosing 128 random keys and calling find()
for at least 100ms for each of those and computing the average time per random key. The table below contains the mean/median/stddev between those 128 randomly chosen keys.
Benchmark | Time | CPU |
---|---|---|
FNV1a mean median stddev |
1.660 ns 1.650 ns 0.035 ns |
1.670 ns 1.560 ns 0.168 ns |
Murmur3F finalizer mean median stddev |
1.640 ns 1.480 ns 0.274 ns |
1.640 ns 1.400 ns 0.314 ns |
XXH3 mean median stddev |
2.340 ns 2.040 ns 0.411 ns |
2.340 ns 2.090 ns 0.500 ns |
SipHash-2-4 mean median stddev |
8.810 ns 8.550 ns 0.488 ns |
8.820 ns 9.340 ns 1.020 ns |
SipHash-1-3 mean median stddev |
7.030 ns 6.680 ns 0.524 ns |
7.030 ns 6.250 ns 0.980 ns |
While SipHash seems promising in preventing hash flood attacks, the actual paper describes it as fast in relation to other MACs and not other PRFs. The claim that it's fast for short messages in general appears to be an overstatement by easily 4x. Considering things like VSO-653642 (complaint about 2x std::hash
performance regression) it might be difficult to introduce SipHash into a language like C++.
Thanks - the SipHash perf data is definitely news to me, so we'll need to consider the available options carefully.
I would consider also looking at SipHash13 https://github.com/rust-lang/rust/issues/29754#issuecomment-156073946
@BillyONeal also mentioned a security consideration:
the hash values are not the same in successive runs of the program to make unordered_Xxx safe to use with untrusted input out of the box.
But I'm not sure if this can be implemented with insignificant perf impact.
I would argue that the default, user doesn't know any better, option, should be the not exploitable one, even at a performance cost. Moreover, the resulting strength of the hash may help some programs by reducing collisions vs. other hash functions; although it will not for very small keys in relatively small containers like 127 int32s.
If you want the absolute fastest of juggling razor blades fast, unordered_map
is not your port of call in the first place.
If you want the absolute fastest of juggling razor blades fast,
unordered_map
is not your port of call in the first place.
std::hash
is not neccessarily for unordered_map
, a proper hash table may use it either
@cbezault For short inputs the performance difference between SipHash-2-4 and 1-3 is minimal. I've updated the table above.
@BillyONeal Good point. I agree that a better, but slower hash function should be preferred as the default one for the stdlib. However SipHash isn't a stronger hash function than alternatives and it's rate of collisions is the same as for any other one that more or less passes the smhasher suite.
But personally I'm still against using SipHash as the default for this STL so far - at least according to what I know at the moment. I believe the absolute majority of code written with this STL are end-user, desktop applications and not publicly exposed web servers. In fact I'd explicitly argue that the number of web servers written in C++ and running on Windows is quite low. While the safety promises of SipHash are nice, it doesn't excuse the increase in computing power consumption for applications that don't require any such safety at all.
Before using a new hash function I believe there's a large number of alternatives we could first consider. smhasher has useful reference code, results and benchmarks on lower end hardware, which could be used as a point of reference for us.
I agree that we'll need to explore the increasingly large space of alternatives (when vNext work resumes). On the STL Discord, I observed that C++'s concerns are somewhat different from Python's; notably, we don't use dictionaries as our universal data structure (so the impact of a hash function choice is dramatically smaller), and map
is available as an alternative to unordered_map
and although map
is typically somewhat slower, it also offers absolute immunity against hash-denial-of-service attacks.
@lhecker observes FNV1a is bad for integers.
https://discord.com/channels/737189251069771789/737189251497721887/913600700846596147
vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.
Also tracked by VSO-371547 / AB#371547 .