Hash function is not randomized properly

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

BPO	14621
Nosy	@malemburg, @arigo, @gpshead, @mdickinson, @ncoghlan, @vstinner, @tiran, @benjaminp, @alex, @davidmalcolm, @PaulMcMillan, @serhiy-storchaka, @phmc, @dstufft
Files	find_hash_collision.py hash.py hash-attack-3.patch _Py_Hash_Sip24.S

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/tiran' closed_at = created_at = labels = ['type-security'] title = 'Hash function is not randomized properly' updated_at = user = 'https://bugs.python.org/VladoBoza' ``` bugs.python.org fields: ```python activity = actor = 'serhiy.storchaka' assignee = 'christian.heimes' closed = True closed_date = closer = 'benjamin.peterson' components = [] creation = creator = 'Vlado.Boza' dependencies = [] files = ['25281', '25375', '27917', '27919'] hgrepos = ['159'] issue_num = 14621 keywords = ['patch'] message_count = 89.0 messages = ['158736', '158744', '158759', '158773', '158778', '158780', '158781', '158783', '158784', '158785', '158790', '158860', '159430', '159431', '159433', '159434', '173185', '173455', '173457', '173458', '173461', '173488', '173491', '173498', '174964', '174973', '174986', '174987', '174989', '174990', '174994', '174998', '174999', '175000', '175007', '175038', '175040', '175047', '175048', '175050', '175052', '175053', '175080', '175081', '175082', '175083', '175085', '175086', '175088', '175089', '175091', '175094', '175096', '175097', '175098', '175099', '175100', '175192', '175196', '175198', '175299', '175318', '175342', '175345', '176680', '176697', '176704', '176705', '176709', '176710', '176713', '176714', '176715', '176720', '176721', '176725', '176808', '178784', '178798', '178800', '178808', '178809', '178814', '178836', '178837', '178842', '203506', '205758', '205759'] nosy_count = 28.0 nosy_names = ['lemburg', 'arigo', 'gregory.p.smith', 'mark.dickinson', 'ncoghlan', 'vstinner', 'christian.heimes', 'benjamin.peterson', 'iElectric', 'Arfrever', 'alex', 'cvrebert', 'dmalcolm', 'Giovanni.Bajo', 'PaulMcMillan', 'serhiy.storchaka', 'bkabrda', 'Vlado.Boza', 'koniiiik', 'sbermeister', 'camara', 'pconnell', '\xc5\x81ukasz.Rekucki', 'ReneSac', 'Bob.Ziuchkovski', 'dstufft', 'isoschiz', 'editor-buzzfeed'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'security' url = 'https://bugs.python.org/issue14621' versions = ['Python 2.7', 'Python 3.3'] ```

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

Fix of this http://bugs.python.org/issue13703 is broken.

tl;dr: There only 256 different hash functions (compare it to size of _Py_HashSecret prefix and suffix). And whether keys collide or not depends only on the last 8 bits of prefix.

Problem with current randomization of hash function is following: Suffix does not influence whether two keys have some hash or not (it is xor-ed after everything). Everything except last 8 bits in prefix does not influence it also. Try adding 0x474200 to prefix and see what happens (it will add 0x474200 to resulting hash).

To make a DoS attack, attacker must do the following: Generate sets of colliding keys for every 256 possible combinations of last 8 bits. Try each one of these sets - one will have significantly bigger response time, and then repeat this one.

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

E.g this strings collide for every prefix ending on 0xcd: 0x27fd5a18, 0x26fe78fa

davidmalcolm commented 12 years ago

Thanks for filing this bug report.

I'm not seeing the equal hashes you describe.

I'm using this recipe to hardcode a specific prefix and print the hashes using it: $ gdb --eval-command="break _PyRandom_Init" --eval-command="run" --eval-command="print _Py_HashSecret" --eval-command="set _Py_HashSecret.prefix=0xcdcdcdcd" --eval-command="print _Py_HashSecret" --eval-command="continue" -eval-command="continue" --args python -c "a='\x27\xfd\x5a\x18'; b='\x26\xfe\x78\xfa'; print(hash(a)); print(hash(b))"

On a 32-bit build of Python 2.7.3 (i686), if I set _Py_HashSecret.prefix=0xcdcdcdcd, I get non-equal hashes for the data you specify (output trimmed somewhat for conciseness):

$1 = {prefix = 0, suffix = 0} $2 = {prefix = -842150451, suffix = 0} Continuing. -121255142 -1199906326

Similarly, on a 64-bit build of Python 2.7.3 (x86_64), I get non-equal hashes: $1 = {prefix = 0, suffix = 0} $2 = {prefix = 3452816845, suffix = 0} -3992804574342296806 -8147489705433570838

Did I misunderstand the report? Thanks.

vstinner commented 12 years ago

I don't understand this issue: can you write a short script to test a collision?

"E.g this strings collide for every prefix ending on 0xcd"

Do you mean that prefix & 0xff == 0xcd?

"0x27fd5a18, 0x26fe78fa"

Is it a byte string or an Unicode string? b'\x27\xfd\x5a\x18' and b'\x26\xfe\x78\xfa'?

--

Using PYTHONHASHSEED environment variable, it's easy to find two values generating the same _Py_HashSecret. Just one example:

PYTHONHASHSEED=3035016679:
* _Py_HashSecret = {0xcd5192eff3fd4d58, 0x3926b1431b200720}
PYTHONHASHSEED=4108758503:
*  _Py_HashSecret = {0xcd5192eff3fd4d58, 0x3926b1431b200720}

--

I wrote find_hash_collision.py to try to compute a collision, but the programs fail with: --- Fail to generate a new seed! # seeds = 65298 --- So it fails to generate a new random seed after testing 65298 different seeds. I ran the script with a function generating a seed, a seed generate a prefix "ending with 0xDC".

See attached program: it generates a random seed. Uncomment "seed = generate_seed_0xCD()" if the prefix must ends with 0xCD byte.

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

My bad (I checked only function in C++, not result in python). This should work on 32bit: Prefix: anything ending on 0x00 Suffix: anything Strings: "\x00\xcf\x0b\x96\x19", "\x00\x6d\x29\x45\x18"

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

For example take this script (on 32bit): ha = hash("\x00\xcf\x0b\x96\x19") hb = hash("\x00\x6d\x29\x45\x18") if ha == hb: print "collision"

And run following: for i in seq 0 25; do echo $i; for j in seq 0 100; do ./python -R x.py; done; done;

It gives collison too many times (around 9 out of 2500).

davidmalcolm commented 12 years ago

$ gdb --eval-command="break _PyRandom_Init" --eval-command="run" --eval-command="print _Py_HashSecret" --eval-command="set _Py_HashSecret.prefix=0xcdcdcd00" --eval-command="print _Py_HashSecret" --eval-command="continue" -eval-command="continue" --args python -c 'a="\x00\xcf\x0b\x96\x19"; b="\x00\x6d\x29\x45\x18"; print(hash(a)); print(hash(b))'

On 32-bit: $1 = {prefix = 0, suffix = 0} $2 = {prefix = -842150656, suffix = 0} 1220138288 1220138288

On 64-bit: $1 = {prefix = 0, suffix = 0} $2 = {prefix = 3452816640, suffix = 0} Continuing. 4087671194599937328 -1679444439011306192

vstinner commented 12 years ago

For example take this script (on 32bit): (...) It gives collison too many times (around 9 out of 2500).

I tried this script on Linux 32 bits and Linux 64 bits: I didn't see any collision. What is your operating system and the version of your operating system please?

e3156d00-7a75-44fe-afea-f8b13a04d3b9 commented 12 years ago

@dmalcolm: As for the gdb example, you need to add --eval-command="set _Py_HashSecret_Initialized=1", otherwise _Py_HashSecret will get overwritten immediately after it is set by gdb, either to 0 if run without the -R switch, or to a random value.

With the fresh pair of values Vlado provided, I managed to reproduce the collisions on Python 2.7.3, i686 (output trimmed like you did):

for __PREFIX in 0x0 0x4700 0xdead00 0xcafe00; do gdb --eval-command="break _PyRandom_Init" --eval-command="run" --eval-command="print _Py_HashSecret" --eval-command="set _Py_HashSecret.prefix=${__PREFIX}" --eval-command="set _Py_HashSecret_Initialized=1" --eval-command="print _Py_HashSecret" --eval-command="continue" -eval-command="continue" --args ./python -c "a='\x00\xcf\x0b\x96\x19'; b='\x00\x6d\x29\x45\x18'; print(hash(a)); print(hash(b))"; done

$1 = {prefix = 0, suffix = 0} $2 = {prefix = 0, suffix = 0} Continuing. 1220138288 1220138288

$1 = {prefix = 0, suffix = 0} $2 = {prefix = 18176, suffix = 0} Continuing. -1483212240 -1483212240

$1 = {prefix = 0, suffix = 0} $2 = {prefix = 14593280, suffix = 0} Continuing. -972665808 -972665808

$1 = {prefix = 0, suffix = 0} $2 = {prefix = 13303296, suffix = 0} Continuing. 1003122480 1003122480

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

I tried this script on Linux 32 bits and Linux 64 bits: I didn't see any >collision. What is your operating system and the version of your >operating system please?

uname -a Linux 3.0.0-17-generic #30-Ubuntu SMP Thu Mar 8 20:45:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Anyway you should be able to do following (in 32bits):

generate two colliding keys (with some random seed)
try these keys with different random seeds and they will collide \~1 out of 256 times

vstinner commented 12 years ago

hash.py: Python implementation of the 32-bit version of hash(bytes). Ok, I now see that only the lower 8 bits are really mixed with the input string.

e7d76618-f6e4-4ba6-b68b-4fee5239ab01 commented 12 years ago

One possible fix: Look for StringHasher in google v8 code (http://code.google.com/p/v8/source/search?q=stringhasher&origq=stringhasher&btnG=Search+Trunk). Main loop looks like this: raw_runninghash += c;
raw_runninghash += (raw_runninghash \<\< 10);
raw_runninghash ^= (raw_runninghash >> 6);

It seems not to have same collisions with many different hash seeds.

vstinner commented 12 years ago

One possible fix: ... Main loop looks like this: ..

Well, it was decided to not impact performances to workaround one specific class of attack, whereas there are many other ways to DoS Python. So we chose to keep the main loop unchanged. Randomizing the hash is not a fix for the hash DoS, it only makes the attacker harder.

vstinner commented 12 years ago

Oops, I attached the wrong "hash.py" file.

benjaminp commented 12 years ago

We should see if more security can be gained without sacrificing speed.

vstinner commented 12 years ago

Problem with current randomization of hash function is following: Suffix does not influence whether two keys have some hash or not (it is xor-ed after everything).

Yes, the suffix is used to "protect" the secret. Without the suffix, it would be too simple to compute the prefix: getting a single hash value of a known string would leak the prefix.

Suffix does not influence whether two keys have some hash or not (...). Everything except last 8 bits in prefix does not influence it also.

I don't know if we can do better and/or if it is a critical issue.

tiran commented 11 years ago

I've modified unicodeobject's unicode_hash() function. V8's algorithm is about 55% slower for a 800 MB ASCII string on my box.

Python's current hash algorithm for bytes and unicode:

while (--len >= 0) x = (_PyHASH_MULTIPLIER x) ^ (Py_uhash_t) \P++;

$ ./python -m timeit -s "t = 'abcdefgh' * int(1E8)" "hash(t)"
10 loops, best of 3: 94.1 msec per loop

V8's algorithm:

    while (--len >= 0) {
        x += (Py_uhash_t) *P++;
        x += ((x + (Py_uhash_t)len) << 10);
        x ^= (x >> 6);
    }

$ ./python -m timeit -s "t = 'abcdefgh' * int(1E8)" "hash(t)"
10 loops, best of 3: 164 msec per loop

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 11 years ago

Just to make it extra clear: Vlado showed that the "-R" switch of Python can easily be made fully pointless, with only a bit of extra work. Here is how:

Assume you have an algo that gives you as many strings with colliding hashes as you want, provided you know the last 8 bits of the secret prefix.
Say you want to attack a web server. You send it 256 requests, each with 100 strings that have identical hash for one of the 256 possible values. You measure which one is significantly slower than the others.
From there you are back in the original situation: you know which of the 256 values to pick, so you can make the web server crawl by sending it a large number of strings that have identical hashes for this particular value.

It's interesting to note how this whole -R discussion made very long threads on python-dev, and python-dev has subsequently ignored (for the past 6 months!) the fact that their "fix" can be worked around in a matter of minutes.

(For information, I'm sure that if the algorithm is improved to depend on all 32 or 64 bits of the prefix, it would still be easy to crack it. You don't actually need to send 2**32 or 2**64 requests to the web server: with careful design you can send only 32 or 64 requests that each leak one bit of information. Doing that requires a bit more research, but once the recipe is known, it can be freely reused, which seems to defeat the original point.)

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 11 years ago

For reference, the above means that we can implement -R support for PyPy as a dummy ignored flag, and get "security" that is very close to CPython's. :-)

benjaminp commented 11 years ago

That doesn't make it any easy CPython issue. :)

tiran commented 11 years ago

As far as my understanding goes the issue can't be solved with our current hash algorithm. We'd have to use a crypto hash function or at least a hash algorithm that has an increased avalanche effect on the outcome. The current hash algorithm is designed and optimized for speed and not for security. Any other algorithm is going to slow down hashing.

Small strings and strings with lots of NUL bytes may leak too many information, too.

vstinner commented 11 years ago

It's interesting to note how this whole -R discussion made very long threads on python-dev, and python-dev has subsequently ignored (for the past 6 months!) the fact that their "fix" can be worked around in a matter of minutes.

No, this issue has no been ignored. Nobody proposed anything to fix this issue, but we are still working on it (sometimes in private).

In my opinion, we cannot solve this issue without slowing down python. Or I don't know yet.a.fast and secure hash algorithm. I don't really want to slow down Python for one specific issue whereas there are so many other ways to DoS a (web) server.

How do other languages (say Perl, Ruby, PHP, Javascript) handle this issue?

serhiy-storchaka commented 11 years ago

$ ./python -m timeit -s "t = 'abcdefgh' * int(1E8)" "hash(t)"

I got another numbers (32-bit Linux, AMD Athlon 64 X2 4600+).

Python's current hash algorithm: 10 loops, best of 3: 343 msec per loop

V8's algorithm: 10 loops, best of 3: 244 msec per loop

malemburg commented 11 years ago

On 21.10.2012 23:42, STINNER Victor wrote:

STINNER Victor added the comment:

> It's interesting to note how this whole -R discussion made very long threads on python-dev, and python-dev has subsequently ignored (for the past 6 months!) the fact that their "fix" can be worked around in a matter of minutes.

No, this issue has no been ignored. Nobody proposed anything to fix this issue, but we are still working on it (sometimes in private).

In my opinion, we cannot solve this issue without slowing down python. Or I don't know yet.a.fast and secure hash algorithm. I don't really want to slow down Python for one specific issue whereas there are so many other ways to DoS a (web) server.

Well, I did propose a different approach to the whole problem to count collisions. That would have avoided the usability issues you have with the randomization approach, made it possible for the application to detect the attack and not have introduced any significant runtime overhead for applications not being attacked.

The proposal was shot down with the argument that it wouldn't fix the problem.

It should also be noted that the randomization only applies to strings/bytes, dictionaries with other colliding keys are not protected at all.

Perhaps it's time to revisit the collision counting idea ?

It would work in much the same way as the stack recursion limit we have in Python.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source  (#1, Oct 22 2012)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2012-09-27: Released eGenix PyRun 1.1.0 ...       http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ...     http://egenix.com/go34
2012-09-25: Released mxODBC 3.2.1 ...             http://egenix.com/go33
2012-10-23: Python Meeting Duesseldorf ...                      tomorrow

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 11 years ago

Benjamin: oups, sorry. I don't remember setting the "easy" keyword, my mistake.

Fwiw I'm +1 on Marc-Andre's solution. Make it a tunable setting, e.g. with sys.setcollisionlimit(). Defaults to sys.maxint on existing Pythons and some smaller value (70?) on new Pythons. It has the same benefits as the recursion limit: it's theoretically bad, but most of the time very useful.

It would also crash on bad usages of custom __hash() methods: e.g. if you put a lot of keys in a dict, all with a custom __hash() that returns 42. I imagine that it can be considered a good thing to raise in this case rather than silently degrade performance forever.

benjaminp commented 11 years ago

Here's the message that helped convince us to go against collision counting originally: http://mail.python.org/pipermail/python-dev/2012-January/115726.html

66ad9e9a-270d-4f37-b73f-1d9f3df7701e commented 11 years ago

How about using a secure hash algorithm that's implemented in HW when available. It doesn't eliminate the issue on systems that lack this support but at least it limits the scope of the problem.

Of course some testing would need to be done to make sure the hardware hashing doesn't have a significant impact on performance.

tiran commented 11 years ago

Our hash randomization will always leak some information about the randomization keys. The only way to properly secure our secrets is a cryptographic secure algorithms, for example a crypto hashing function in combination with a message authentication code like HMAC. I don't have to explain how that is going to hurt performance ...

We can try to make it harder to guess the secret parts with a slightly modified algorithm like e.g. V8's hash but that's never going to be 100% secure. But might be secure enough to make an attack too hard.

tiran commented 11 years ago

I deem hash randomization and collision counting as a poor man's workaround for the actual issue. Perhaps we shouldn't try too hard to fix an unsuitable data type. Hash maps have a known worst case complexity of O(n). A O(log n) algorithm should be used to parses and malicious key/value pairs.

How about Python grows a additional btree implementation in its collections module? I know that it's not going to fix existing code. However in the long run it's the best safeguard against hash collision attacks. I'm thinking about a simple, self balancing btree like red-black-tree. A quick search on Wikipedia also revealed Scapegoat and Splay tree with interesting properties.

8fda13ee-b3df-42fe-9071-0320b47e29e6 commented 11 years ago

Christian, there are good semi-crypto hash functions that don't leak as bad as Python's own modified FNV hash, without going all the way to HMAC.

SipHash has very good collision resistance and doesn't leak anything: https://www.131002.net/siphash/ (notice: they distribute a python program to recover python's seed)

It's obviously slower than Python's FNV, but it's hard to beat a sum+multiplication per character.

tiran commented 11 years ago

Thanks!

SipHash looks interesting. It's using a XOR + ROT approach with a seed. And it's written by DJB which is usually a good sign. He writes secure code with good quality. Just his coding style tends to be ... unique. :)

I'm going to try the algorithm.

tiran commented 11 years ago

I modified crypto_auth() a bit:

Py_uhash_t crypto_auth(const unsigned char *in, unsigned long long inlen)
  ...
  u64 k0 = _Py_HashSecret.prefix;
  u64 k1 = _Py_HashSecret.suffix;
  ...
  return (Py_uhash_t)b;

and replaced the loop in _Py_HashBytes() with a call to crypto_auth(). For large strings SipHash is as faster as our current algorithm on my 64bit box. That was to be expected as SipHash works on blocks of 8 bytes while the default algorithm can't be optimized with SIMD instructions.

Current hashing algorithm: $ ./python -m timeit -s "x = b'a' * int(1E7)" "hash(x)" 1000000 loops, best of 3: 0.39 usec per loop

SipHash: $ ./python -m timeit -s "x = b'a' * int(1E7)" "hash(x)" 1000000 loops, best of 3: 0.381 usec per loop

8fda13ee-b3df-42fe-9071-0320b47e29e6 commented 11 years ago

For short strings, you might want to have a look at the way you fetch the final partial word from memory.

If the string is >= 8 bytes, you can fetch the last partial word as an unaligned memory fetch followed by a shift, instead of using a switch like in the reference code.

tiran commented 11 years ago

We can explore the various optimization options later. Also unaligned memory address is not allowed on some architectures like SPARC.

If somebody likes to play with the algorithm: http://hg.python.org/sandbox/cheimes/shortlog/2cb7e97ca8d0

serhiy-storchaka commented 11 years ago

$ ./python -m timeit -s "x = b'a' * int(1E7)" "hash(x)"

Note that hash calculated only once. Add -n 1 option and use a larger data.

If somebody likes to play with the algorithm:

$ ./python -m timeit -n 1 -s "t = 'abcdefgh' 10*\8" "hash(t)" 1 loops, best of 3: 4.86 sec per loop

Current hash algorithm runs 3.43 sec, V8's algorithm runs 2.44 sec.

With simple optimization I got 3.62 sec, only 6% slower than the current.

  #define U8TO64_LE(p) ((u64)((u32 *)(p))[0] | ((u64)((u32 *)(p))[1] << 32))

tiran commented 11 years ago

Thanks to Snakebit I was able to tests the code on a 32bit BSD installation with GCC 4.2. The ASCII unicode and bytes performance is about 8% slower, UCS2 unicode is about 37% slower. There might be room for improvements, though.

% ./python -m timeit -r20 -n1000000 -s "h = hash; x = 'a' 10*\7" -- "h(x)" Current: 1000000 loops, best of 20: 0.109 usec per loop SipHash: 1000000 loops, best of 20: 0.118 usec per loop

% ./python -m timeit -r20 -n1000000 -s "h = hash; x = 'ä' 10*\7" -- "h(x)" Current: 1000000 loops, best of 20: 0.119 usec per loop SipHash: 1000000 loops, best of 20: 0.163 usec per loop

tiran commented 11 years ago

Serhiy's trick

#define U8TO64_LE(p) ((u64)((u32 *)(p))[0] | ((u64)((u32 *)(p))[1] << 32))

gives a nice speedup. Now UCS2 is down to 0.133 usec (12% slower than the current algorithm) and ASCII down to 0.105 usec (3% faster).

serhiy-storchaka commented 11 years ago

% ./python -m timeit -r20 -n1000000 -s "h = hash; x = 'a' 10*\7" -- "h(x)"

Here is only one hash calculation and 999999 cached calls.

serhiy-storchaka commented 11 years ago

I tested different kind of strings.

$ ./python -m timeit -n 1 -s "t = b'a' * 10**8"  "hash(t)"
$ ./python -m timeit -n 1 -s "t = 'a' * 10**8"  "hash(t)"
$ ./python -m timeit -n 1 -s "t = '\u0100' * 10**8"  "hash(t)"
$ ./python -m timeit -n 1 -s "t = '\U00010000' * 10**8"  "hash(t)"

   current   SipHash

bytes 181 msec 453 msec 2.5x UCS1 429 msec 453 msec 1.06x UCS2 179 msec 897 msec 5x UCS4 183 msec 1.79 sec 9.8x

malemburg commented 11 years ago

Here's a demo patch (against Python 2.7) which counts hash value collisions and slot collisions. I had posted that in the original ticket where we discussed the hash problem (http://bugs.python.org/issue14621).

This avoids issues like attack 1 mentioned in http://mail.python.org/pipermail/python-dev/2012-January/115726.html

Attack 2 in that email can easily be worked around by reducing the collision limit to a smaller number.

Even better: An application could even dynamically adjust the maximum collision counts by catching the exception and setting a new upper limit depending on its knowledge of the field of application - warning the sysadmin of a potential problem and allowing her to take action. That way the application could start with a low safe maximum collision number of say 100 and then raise the limit in a controlled way.

BTW: When trying out new hash functions, you need to look not only at the performance of the hash function, but also (and more importantly) at the effect on dictionaries.

Just as reminder: The integer key problem is still open. Using the demo script http://bugs.python.org/file24300/integercollision.py, it's easy to keep Python going for minutes without any major effort.

I don't understand why we are only trying to fix the string problem and completely ignore other key types. Strings are easy to send to a web server, yes, but there are other applications out there which take input data from other sources/formats as well (e.g. csv files). And it's not unusual to convert input strings to integers to use them as dictionary keys, say item IDs or counts. So while the string keys may not cause a problem, the integer keys still might.

malemburg commented 11 years ago

On 07.11.2012 09:34, Marc-Andre Lemburg wrote:

Here's a demo patch (against Python 2.7) which counts hash value collisions and slot collisions. I had posted that in the original ticket where we discussed the hash problem (http://bugs.python.org/issue14621).

Sorry, wrong URL. The correct one is http://bugs.python.org/issue13703

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source  (#1, Nov 07 2012)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

8fda13ee-b3df-42fe-9071-0320b47e29e6 commented 11 years ago

Until it's broken with a yet-unknown attack, SipHash is a pseudo-random function and as such it does uniformly distribute values across the output space, and never leak any information on the key (the randomized seed). Being designed by cryptographers, it is likely that it doesn't turn out to be a "fail" like the solution that was just released (no offense intended, but it's been a large-scale PR failure).

As long as we don't introduce bias while reducing SipHash's output to fit the hash table size (so for instance, usage of modulus is not appropriate), then the hash function should behave very well.

Any data type can be supplied to SipHash, including numbers; you just need to take their (platform-dependent) memory representation and feed it to SipHash. Obviously it will be much much slower than the current function which used to be hash(x) = x (before randomization), but that's the price to pay to avoid security issues.

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 11 years ago

Marc-André: estimating the risks of giving up on a valid query for a truly random hash, at an overestimated one billion queries per second, in a 2/3 full dictionary:

for 1000: 4E159 years between mistakes
for 100: 12.9 years between mistakes
for 150: 8E9 years between mistakes
for 200: 5E18 years between mistakes

So while it seems that 100 might be a bit too small, using 150 to 200 is perfectly safe (and that's "perfect" in the sense that a computer will encounter random hardware errors at a higher rate than that).

malemburg commented 11 years ago

On 07.11.2012 12:06, Armin Rigo wrote:

Armin Rigo added the comment:

Marc-André: estimating the risks of giving up on a valid query for a truly random hash, at an overestimated one billion queries per second, in a 2/3 full dictionary:

for 1000: 4E159 years between mistakes

for 100: 12.9 years between mistakes

for 150: 8E9 years between mistakes

for 200: 5E18 years between mistakes

So while it seems that 100 might be a bit too small, using 150 to 200 is perfectly safe (and that's "perfect" in the sense that a computer will encounter random hardware errors at a higher rate than that).

I used the 1000 limit only as example. In tests Victor and I ran (see the original ticket from a few months ago), 200 turned out to be a reasonable number for the default maximum hash collision value.

I'm not sure about the slot collision limit. We'd have to run more tests on those.

8fda13ee-b3df-42fe-9071-0320b47e29e6 commented 11 years ago

Il giorno 07/nov/2012, alle ore 08:40, Serhiy Storchaka \report@bugs.python.org\ ha scritto:

Serhiy Storchaka added the comment:

I tested different kind of strings.

$ ./python -m timeit -n 1 -s "t = b'a' 10**8" "hash(t)" $ ./python -m timeit -n 1 -s "t = 'a' 10**8" "hash(t)" $ ./python -m timeit -n 1 -s "t = '\u0100' 10**8" "hash(t)" $ ./python -m timeit -n 1 -s "t = '\U00010000' 10**8" "hash(t)"
  current   SipHash
bytes 181 msec 453 msec 2.5x UCS1 429 msec 453 msec 1.06x UCS2 179 msec 897 msec 5x UCS4 183 msec 1.79 sec 9.8x

Hi Serhiy,

can you please attach the generated assembly code for the siphash function with your compiler and your optimization flags (that is, the one that produces the above results)?

Thanks! -- Giovanni Bajo

serhiy-storchaka commented 11 years ago

can you please attach the generated assembly code for the siphash function with your compiler and your optimization flags (that is, the one that produces the above results)?

GCC (Ubuntu 4.4.3-4ubuntu5.1) options:

-pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I. -IInclude -I./Include -DPy_BUILD_CORE

32-bit Linux on AMD Athlon(tm) 64 X2 Dual Core Processor 4600+.

tiran commented 11 years ago

Serhiy, the performance of hash() for long strings isn't very relevant for the general performance of a Python program. Short strings dominate. I've modified the timeit to create a new string object every time.

for I in 5 10 15 20 30 40 50 60; do echo -ne "$I\t"; ./python -m timeit -n100000 -r30 -s "h = hash; x = 'ä' * $I" -- "h(x + 'a')" | awk '{print $6}' ; done

ASCII: # SIP FNV 5 0.112 0.0979 10 0.115 0.103 15 0.12 0.107 20 0.124 0.112 30 0.126 0.127 40 0.136 0.142 50 0.142 0.147 60 0.146 0.159

UCS-2: # SIP FNV 5 0.114 0.0977 10 0.117 0.0988 15 0.12 0.11 20 0.126 0.109 30 0.13 0.122 40 0.14 0.132 50 0.144 0.147 60 0.152 0.157

For short strings the additional round and setup costs make hash() about 10% slower. For long strings SIP is faster as it processes 8 bytes at once instead of 1 to 4 bytes.

mdickinson commented 11 years ago

[MAL]

I don't understand why we are only trying to fix the string problem and completely ignore other key types.

[Armin]

estimating the risks of giving up on a valid query for a truly random hash, at an overestimated one billion queries per second ...

That's fine in principle, but if this gets extended to integers, note that our current integer hash is about as far from 'truly random' as you can get:

    Python 3.4.0a0 (default:f02555353544, Nov  4 2012, 11:50:12) 
    [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> [hash(i) for i in range(20)]
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Moreover, it's going to be very hard to change the int hash while preserving the x == y implies hash(x) == hash(y) invariant across all the numeric types (int, float, complex, Decimal, Fraction, 3rd-party types that need to remain compatible).

malemburg commented 11 years ago

On 07.11.2012 12:55, Mark Dickinson wrote:
> 
> Mark Dickinson added the comment:
> 
> [MAL]
>> I don't understand why we are only trying to fix the string problem
>> and completely ignore other key types.
> 
> [Armin]
>> estimating the risks of giving up on a valid query for a truly random
>> hash, at an overestimated one billion queries per second ...
> 
> That's fine in principle, but if this gets extended to integers, note that our current integer hash is about as far from 'truly random' as you can get:
> 
>     Python 3.4.0a0 (default:f02555353544, Nov  4 2012, 11:50:12) 
>     [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
>     Type "help", "copyright", "credits" or "license" for more information.
>     >>> [hash(i) for i in range(20)]
>     [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
> 
> Moreover, it's going to be *very* hard to change the int hash while preserving the `x == y implies hash(x) == hash(y)` invariant across all the numeric types (int, float, complex, Decimal, Fraction, 3rd-party types that need to remain compatible).

Exactly. And that's why trying to find secure hash functions isn't going to solve the problem. Together with randomization they may make things better for strings, but they are no solution for numeric types, and they also don't allow detecting possible attacks on your systems.

But yeah, I'm repeating myself :-)

mdickinson commented 11 years ago

And I'm probably repeating myself too, but: the predictability of (and difficulty of changing of) hashing for numeric types is why I'm strongly opposed to hash collision / slot collision limits: they'd end up disallowing reasonably natural looking Python numeric sets (e.g. {2**k for k in range(n)} for smallish n). I don't think core Python should be solving this issue at all---I think that's a job for the web frameworks. Christian's idea of providing more suitable types in the std. lib. sounds like the right direction to me.

python / cpython

Hash function is not randomized properly #58826