Weakness in tuple hash - Githubissues

cd4845df-51b1-4bf1-b9fe-a1fc87fd16f1 commented 20 years ago

BPO	942952
Nosy	@tim-one, @arigo, @rhettinger

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core'] title = 'Weakness in tuple hash' updated_at = user = 'https://bugs.python.org/smst' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'none' closed = True closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'smst' dependencies = [] files = [] hgrepos = [] issue_num = 942952 keywords = [] message_count = 18.0 messages = ['20593', '20594', '20595', '20596', '20597', '20598', '20599', '20600', '20601', '20602', '20603', '20604', '20605', '20606', '20607', '20608', '20609', '20610'] nosy_count = 6.0 nosy_names = ['tim.peters', 'arigo', 'rhettinger', 'jimjjewett', 'smst', 'ygale'] pr_nums = [] priority = 'high' resolution = 'fixed' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue942952' versions = [] ```

cd4845df-51b1-4bf1-b9fe-a1fc87fd16f1 commented 20 years ago

I've encountered some performance problems when constructing dictionaries with keys of a particular form, and have pinned the problem down to the hashing function. I've reproduced the effect in Python 1.5.2, Python 2.2 and Python 2.3.3. I came across this when loading a marshalled dictionary with about 40000 entries. Loading other dictionaries of this size has always been fast, but in this case I killed the interpreter after a few minutes of CPU thrashing. The performance problem was caused because every key in the dictionary had the same hash value.

The problem is as follows: for hashable values X and Y, hash( (X, (X, Y)) ) == hash(Y). This is always true (except in the corner case where hash((X, Y)) is internally calculated to be -1 (the error value) and so -2 is forced as the return value). With data in this form where X varies more than Y (Y is constant, or chosen from relatively few values compared to X) chances of collision become high as X is effectively ignored.

The hash algorithm for tuples starts with a seed value, then generates a new value for each item in the tuple by multiplying the iteration's starting value by a constant (keeping the lowest 32 bits) and XORing with the hash of the item. The final value is then XORed with the tuple's length. In Python (ignoring the careful business with -1):

    # assume 'my_mul' would multiply two numbers and
return the low 32 bits
    value = seed
    for item in tpl:
        value = my_mul(const, value) ^ hash(item)
    value = value ^ len(tpl)

The tuple (X, Y) therefore has hash value:

my_mul(const, my_mul(const, seed) ^ hash(X)) ^

hash(Y) ^ 2

...and the tuple (X, (X, Y)) has hash value:

my_mul(const, my_mul(const, seed) ^ hash(X)) ^

hash((X, Y)) ^ 2

The outer multiplication is repeated, and is XORed with itself (cancelling both of them). The XORed 2s cancel also, leaving just hash(Y). Note that this cancellation property also means that the same hash value is shared by (X, (X, (X, (X, Y)))), and (X, (X, (X, (X, (X, (X, Y)))))), and so on, and (X, Z, (X, Z, Y)) and so on.

I realise that choosing a hash function is a difficult task, so it may be that the behaviour seen here is a tradeoff against other desireable properties -- I don't have the expertise to tell. My naive suggestion would be that an extra multiplication is necessary, which presumably has a performance impact (multiplication being more expensive than XOR) but would reduce the possibility of cancellation. On the other hand, perhaps this particular case is rare enough that it's not worth the effort.

For my own application I'm fortunate in that I can probably rearrange the data structure to avoid this case.

cd4845df-51b1-4bf1-b9fe-a1fc87fd16f1 commented 20 years ago

Logged In: YES user_id=42335

I'll repeat the tuple hashing algorithm in a fixed-width font (with the first line following "for..." being the loop body):

 # assume 'my_mul' would multiply two numbers and
return the low 32 bits
value = seed
for item in tpl:
    value = my_mul(const, value) ^ hash(item)
value = value ^ len(tpl)

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 20 years ago

Logged In: YES user_id=764593

Any hash will have some collisions. If there is going to be predictably bad data, this is probably a good place to have it.

The obvious alternatives are a more complicated hash (slows everything down), a different hash for embedded tuples (bad, since hash can't be cached then) or ignoring some elements when determining the hash (bad in the normal case of different data).

I would also expect your workaround of data rearrangement to be sensible almost any time (X, (X, Y)) is really a common case. (The intuitive meaning for me is "X - then map X to Y", which could be done as (X, Y) or at least (X, (None, Y)), or perhaps d[X]=(X,Y).)

rhettinger commented 20 years ago

Logged In: YES user_id=80475

The OP was not referring to "some collisions"; his app collapsed all entries to a single hash value. Changing XOR to + would partially eliminate the self cancelling property of this hash function.

Also, I am concerned about the tuple hash using the same multiplier as the hash for other objects. In sets.py, a naive combination of the component hash values caused many distinct sets to collapse to a handful of possibilities -- while tuples do not have an identical issue, it does highlight the risks involved.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 20 years ago

Logged In: YES user_id=764593

Wouldn't that just shift the pathological case from (x, (x, y)) to (x, (-x, y))? My point was that any hash function will act badly on *some* pattern of input, and if a pattern must be penalized, this might be a good pattern to choose.

c7e4e841-dd04-48f6-a6d3-913b4c8fa9d1 commented 20 years ago

Logged In: YES user_id=1033539

I suggest leaving the function as is, but replacing the constant with a value that varies as we step through the tuple. Take the values from a fixed table. When we reach the end of the table, start again from the beginning and mung the values slightly (e.g., increment them).

True, there will always be the possibilities of collisions, but this will make it very unlikely except in very weird cases. Using a table of constants is a standard technique for hashes.

74c4563b-ab1c-43d8-9219-30c4eca796bc commented 20 years ago

Logged In: YES user_id=764593

Note that in this case, the problem was because of nested tuples. X was the first element of both tuples, and both tuples had the same length -- so the X's would still have been multiplied by the same constant. (Not the same constant as Y, and hash(X, Y), but the same constant as each other.) A non-linear function might work, but would do bad things to other data.

c7e4e841-dd04-48f6-a6d3-913b4c8fa9d1 commented 20 years ago

Logged In: YES user_id=1033539

Hmm, you are correct. This is appears to be an off-by-one problem: the original seed always gets multiplied by the constant (which is silly), and the last item in the tuple does not get multiplied (which causes the bug).

The correct solution is to change:

value = my_mul(const, value) ^ hash(item)

in Steve's pseudo-code to:

value = my_mul(const, value ^ hash(item))

Of course, you still get a lot more robustness for almost no cost if you vary "const" across the tuple via a table.

tim-one commented 20 years ago

Logged In: YES user_id=31435

Noting that

(X, (Y, Z))

and

(Y, (X, Z))

hash to the same thing today too (module -1 endcases). For example,

>>> hash((3, (4, 8))) == hash((4, (3, 8)))
True
>>>

tim-one commented 20 years ago

Logged In: YES user_id=31435

I can't make more time for this, so unassigned myself.

This seems to make a good set of test inputs:

N = whatever
base = range(N)
xp = [(i, j) for i in base for j in base]
inps = base + [(i, j) for i in base for j in xp] + \
             [(i, j) for i in xp for j in base]

When N == 50, inps contains 250,050 distinct elements, and 66,966 of them generate a duplicated hash code. Most simple changes to the current algorithm don't do significantly better, and some do much worse. For example, just replacing the xor with an add zooms it to 119,666 duplicated hash codes across that set.

Here's a hash function that yields no collisions across that set:

def hx(x):
    if isinstance(x, tuple):
        result = 0x345678L
        mult = 1000003
        for elt in x:
            y = hx(elt)
            result = (((result + y) & 0xffffffffL) * mult) & 
0xffffffffL
            mult = (mult * 69069) & 0xffffffffL
        result ^= len(x)
        if result == -1:
            result = -2
    else:
        result = hash(x)
    return result

In C, none of the "& 0xffffffffL" clauses are needed; in Python code, they're needed to cut results back to 32 bits. 69069 is a well-known multiplier for a better-than-most pure multiplicative PRNG.

This does add a multiply per loop trip over what we have today, but it can proceed in parallel with the existing multiply. Unlike a canned table, it won't repeat multipliers, (unless the tuple has more than a billion elements).

c7e4e841-dd04-48f6-a6d3-913b4c8fa9d1 commented 20 years ago

Logged In: YES user_id=1033539

Tim, your test fixture is very nice.

I verified your result that there are no collisions over the fixture set for the algorithm you specified. Based on that result, I would recommend that we NOT use your algorithm.

The probability of no collisions is near zero for 250000 random samples taken from a set of 2^32 elements. So the result shows that your algorithm is far from a good hash function, though it is constructed to be great for that specific fixture set.

I ran my algorithm on your fixture set using the table of 64 constants taken from MD5. I got 13 collisions, a reasonable result.

It is not true that I repeat multipliers (very often) - note that I increment the elements of the table each time around, so the sequence repeats itself only after 2^38 tuple elements. Also, my algorithm runs at esentially the same speed as the current one: no additional multiplies, only a few adds and increments and such.

c7e4e841-dd04-48f6-a6d3-913b4c8fa9d1 commented 20 years ago

Logged In: YES user_id=1033539

Here is the Python code for the hash function that I am describing:

def hx(x):
    if isinstance(x, tuple):
        result = 0x345678L
        for n, elt in enumerate(x):
            y = (tbl[n & 0x3f] + (n >> 6)) & 0xffffffffL
            result = (y * (result ^ hx(elt))) & 0xffffffffL
        result ^= len(x)
    else:
        result = hash(x)
    return result

# 64 constants taken from MD5
# (see Modules/md5c.c in the Python sources)
tbl = (
    -0x28955b88, -0x173848aa,  0x242070db, -0x3e423112,
    -0xa83f051,   0x4787c62a, -0x57cfb9ed, -0x2b96aff,
     0x698098d8, -0x74bb0851, -0xa44f,     -0x76a32842,
     0x6b901122, -0x2678e6d,  -0x5986bc72,  0x49b40821,
    -0x9e1da9e,  -0x3fbf4cc0,  0x265e5a51, -0x16493856,
    -0x29d0efa3,  0x2441453,  -0x275e197f, -0x182c0438,
     0x21e1cde6, -0x3cc8f82a, -0xb2af279,   0x455a14ed,
    -0x561c16fb, -0x3105c08,   0x676f02d9, -0x72d5b376,
    -0x5c6be,    -0x788e097f,  0x6d9d6122, -0x21ac7f4,
    -0x5b4115bc,  0x4bdecfa9, -0x944b4a0,  -0x41404390,
     0x289b7ec6, -0x155ed806, -0x2b10cf7b,  0x4881d05,
    -0x262b2fc7, -0x1924661b,  0x1fa27cf8, -0x3b53a99b,
    -0xbd6ddbc,   0x432aff97, -0x546bdc59, -0x36c5fc7,
     0x655b59c3, -0x70f3336e, -0x100b83,   -0x7a7ba22f,
     0x6fa87e4f, -0x1d31920,  -0x5cfebcec,  0x4e0811a1,
    -0x8ac817e,  -0x42c50dcb,  0x2ad7d2bb, -0x14792c6f
    )

tim-one commented 20 years ago

Logged In: YES user_id=31435

Nothing wrong with being "better than random" -- it's not the purpose of Python's hash() to randomize, but to minimize hash collisions. That's why, e.g., hash(int) returns int unchanged. Then when indexing a dict by any contiguous range of ints (excepting -1), there are no collisions at all.
That's a good thing.

There's nothing in the structure of my example function geared toward the specific test instances used. It *should* do a good job of "spreading out" inputs that are "close together", and that's an important case, and the test inputs have a lot of that. It's a good thing that it avoids all collisions across them.

About speed, the only answer is to time both, and across several major platforms. An extra mult may or may not cost more than blowing extra cache lines to suck up a table of ints.

About the table of ints, it's always a dubious idea to multiply by an even number. Because the high bits of the product are thrown away, multiplying by an even number throws away information needlessly (it throws away as many useful bits as there are trailing zero bits in the multiplier). odd_number 69069*\i is always odd, so the multiplier in the table-free way is never even (or, worse, divisible by 4, 8, 16, etc).

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 20 years ago

Logged In: YES user_id=4771

Tim:

Most simple changes to the current algorithm don't do significantly better, and some do much worse.

The simple change suggested by ygale (put the xor before the multiply) does a quite reasonable job in your test suite. The number of collisions drop from 66966 to 1466. It also solves the original problem with (x, (x,y)), which should not be ignored because it could naturally occur if someone is using d.items() to compute a hash out of a dict built by d[x] = x,y.

With this change it seems more difficult to find examples of regular families of tuples that all turn out to have the same hash value. Such examples do exist: (x, (1684537936,x)) yields 2000004 for any x. But the value 1684537936 has been carefully constructed for this purpose, it's the only number to misbehave like this :-)

tim-one commented 20 years ago

Logged In: YES user_id=31435

If 250,050 balls are tossed into 2**32 bins at random, the mean number of occupied bins is \~250042.7 (== expected number of collisions is \~7.3), with std deviation \~2.7. While we're happy to get "better than random" distributions, "astronomically worse than random" distributions usually predict more of the same kind of problem reported later.

That said, I certainly agree that would be a major, easy, and cheap improvement over the status quo.

c7e4e841-dd04-48f6-a6d3-913b4c8fa9d1 commented 20 years ago

Logged In: YES user_id=1033539

I uploaded my current diff and some unit tests to patch bpo-957122.

Please look over the unit tests carefully. I am sure there is much more we can do.

Note that besides the addition of the table containing the constants, only three lines of code are changed from the original function. By inspection, this code should run at essentially the same speed as the old code on all platforms (tests anyone?)

Based on the results of the unit tests, I had to do some tweaking.

First of all, Tim is correct: constants that are divisible by any small prime (e.g., 2) are a problem. To be on the safe side, I replaced each md5 constant with a nearby prime.

I ran into a problem with empty tuples nested in tuples. The seed kept getting xored with itself. It turns out that no matter how nice the md5 constants are, they all produce the same value when multiplied by zero. Find "++x" in the code to see the fix for this.

Finally, I moved the xor with the tuple length from the end of the calculation to the beginning. That amplifies the effect very nicely.

c7e4e841-dd04-48f6-a6d3-913b4c8fa9d1 commented 20 years ago

Logged In: YES user_id=1033539

Re Tim's concern about cache hits:

The trade-off is a slight risk of slight performance degradation when hashing very long tuples, versus the risk of complete app failure in certain rare cases like Steve's. I think it is worthwhile to use more constants.

Anecdotally, I did not experience any qualitative slowdown while exercising these things with the unit tests. OTH, I didn't time anything, and anyway I am using a creaky old Celeron.

Another alternative: use two constants:

 x = (len & 0x1 ? 0xd76aa471 : 0xe8c7b75f) * (++x ^ y);

That passes all of the current unit tests, including Tim's set. However, with that I am less confident that this bug won't come up again in the future.

rhettinger commented 20 years ago

Logged In: YES user_id=80475

Applied a small, slightly cheaper (no second multiply) variation on Tim's original. It is not as scientfic looking, but is does assure changing, odd multipliers for each position. When Tim moved the XOR before the multiply, that fixed the OP's original problem.

The new variation survives Tim's torture test which I augmented a bit and put in the test suite for the benefit of future generations.

See: Objects\tupleobject.c 2.91 Lib\test\test_tuple.py 1.3

python / cpython

Weakness in tuple hash #40190