microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
10.25k stars 1.51k forks source link

`<functional>`: better hash function #2360

Open AlexGuteniev opened 3 years ago

AlexGuteniev commented 3 years ago

@lhecker observes FNV1a is bad for integers.

https://discord.com/channels/737189251069771789/737189251497721887/913600700846596147

vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.

Also tracked by VSO-371547 / AB#371547 .

AlexGuteniev commented 3 years ago

https://github.com/microsoft/STL/pull/686#issuecomment-609261700 mentions SipHash-2-4

lhecker commented 3 years ago

I did some benchmarks based on https://github.com/rurban/smhasher The test involves putting numbers 0 to 127 into a std::unordered_map<uint32_t, uint32_t>, choosing 128 random keys and calling find() for at least 100ms for each of those and computing the average time per random key. The table below contains the mean/median/stddev between those 128 randomly chosen keys.

Benchmark Time CPU
FNV1a
mean
median
stddev

1.660 ns
1.650 ns
0.035 ns

1.670 ns
1.560 ns
0.168 ns
Murmur3F finalizer
mean
median
stddev

1.640 ns
1.480 ns
0.274 ns

1.640 ns
1.400 ns
0.314 ns
XXH3
mean
median
stddev

2.340 ns
2.040 ns
0.411 ns

2.340 ns
2.090 ns
0.500 ns
SipHash-2-4
mean
median
stddev

8.810 ns
8.550 ns
0.488 ns

8.820 ns
9.340 ns
1.020 ns
SipHash-1-3
mean
median
stddev

7.030 ns
6.680 ns
0.524 ns

7.030 ns
6.250 ns
0.980 ns

While SipHash seems promising in preventing hash flood attacks, the actual paper describes it as fast in relation to other MACs and not other PRFs. The claim that it's fast for short messages in general appears to be an overstatement by easily 4x. Considering things like VSO-653642 (complaint about 2x std::hash performance regression) it might be difficult to introduce SipHash into a language like C++.

StephanTLavavej commented 3 years ago

Thanks - the SipHash perf data is definitely news to me, so we'll need to consider the available options carefully.

cbezault commented 3 years ago

I would consider also looking at SipHash13 https://github.com/rust-lang/rust/issues/29754#issuecomment-156073946

AlexGuteniev commented 3 years ago

@BillyONeal also mentioned a security consideration:

the hash values are not the same in successive runs of the program to make unordered_Xxx safe to use with untrusted input out of the box.

But I'm not sure if this can be implemented with insignificant perf impact.

BillyONeal commented 3 years ago

I would argue that the default, user doesn't know any better, option, should be the not exploitable one, even at a performance cost. Moreover, the resulting strength of the hash may help some programs by reducing collisions vs. other hash functions; although it will not for very small keys in relatively small containers like 127 int32s.

If you want the absolute fastest of juggling razor blades fast, unordered_map is not your port of call in the first place.

AlexGuteniev commented 3 years ago

If you want the absolute fastest of juggling razor blades fast, unordered_map is not your port of call in the first place.

std::hash is not neccessarily for unordered_map, a proper hash table may use it either

lhecker commented 3 years ago

@cbezault For short inputs the performance difference between SipHash-2-4 and 1-3 is minimal. I've updated the table above.

@BillyONeal Good point. I agree that a better, but slower hash function should be preferred as the default one for the stdlib. However SipHash isn't a stronger hash function than alternatives and it's rate of collisions is the same as for any other one that more or less passes the smhasher suite.

But personally I'm still against using SipHash as the default for this STL so far - at least according to what I know at the moment. I believe the absolute majority of code written with this STL are end-user, desktop applications and not publicly exposed web servers. In fact I'd explicitly argue that the number of web servers written in C++ and running on Windows is quite low. While the safety promises of SipHash are nice, it doesn't excuse the increase in computing power consumption for applications that don't require any such safety at all.

Before using a new hash function I believe there's a large number of alternatives we could first consider. smhasher has useful reference code, results and benchmarks on lower end hardware, which could be used as a point of reference for us.

StephanTLavavej commented 2 years ago

I agree that we'll need to explore the increasingly large space of alternatives (when vNext work resumes). On the STL Discord, I observed that C++'s concerns are somewhat different from Python's; notably, we don't use dictionaries as our universal data structure (so the impact of a hash function choice is dramatically smaller), and map is available as an alternative to unordered_map and although map is typically somewhat slower, it also offers absolute immunity against hash-denial-of-service attacks.