python / cpython

The Python programming language
https://www.python.org
Other
62.33k stars 29.94k forks source link

Hash collisions for tuples #78932

Closed jdemeyer closed 5 years ago

jdemeyer commented 5 years ago
BPO 34751
Nosy @tim-one, @rhettinger, @mdickinson, @ericvsmith, @jdemeyer, @sir-sigurd
PRs
  • python/cpython#9471
  • python/cpython#9534
  • Files
  • htest.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/tim-one' closed_at = created_at = labels = ['interpreter-core', '3.8', 'performance'] title = 'Hash collisions for tuples' updated_at = user = 'https://github.com/jdemeyer' ``` bugs.python.org fields: ```python activity = actor = 'jdemeyer' assignee = 'tim.peters' closed = True closed_date = closer = 'rhettinger' components = ['Interpreter Core'] creation = creator = 'jdemeyer' dependencies = [] files = ['47857'] hgrepos = [] issue_num = 34751 keywords = ['patch'] message_count = 122.0 messages = ['325870', '325883', '325891', '325936', '325949', '325950', '325953', '325955', '325956', '325957', '325959', '325960', '325964', '325981', '325982', '326016', '326018', '326021', '326023', '326024', '326026', '326029', '326032', '326047', '326067', '326081', '326083', '326086', '326110', '326115', '326117', '326118', '326119', '326120', '326147', '326173', '326193', '326194', '326198', '326200', '326203', '326212', '326214', '326217', '326218', '326219', '326236', '326278', '326290', '326302', '326306', '326309', '326331', '326332', '326333', '326334', '326337', '326352', '326375', '326396', '326435', '326505', '326507', '326508', '326510', '326512', '326583', '326602', '326603', '326615', '326640', '326653', '326667', '326753', '326765', '326820', '326835', '326872', '326874', '326877', '326878', '326883', '326892', '326893', '326894', '326898', '326903', '326905', '326906', '326907', '326908', '326910', '326917', '326918', '326927', '326929', '326947', '326949', '326961', '327005', '327014', '327038', '327042', '327043', '327044', '327045', '327055', '327061', '327063', '327078', '327258', '327260', '327298', '327311', '327319', '327380', '327391', '327394', '327423', '327460', '328665', '329287'] nosy_count = 6.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'eric.smith', 'jdemeyer', 'sir-sigurd'] pr_nums = ['9471', '9534'] priority = 'low' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue34751' versions = ['Python 3.8'] ```

    jdemeyer commented 5 years ago

    It's not hard to come up with a hash collision for tuples:

    >>> hash( (1,0,0) )
    2528505496374819208
    >>> hash( (1,-2,-2) )
    2528505496374819208

    The underlying reason is that the hashing code mixes ^ and *. This can create collisions because, for odd x, we have x ^ (-2) == -x and this minus operation commutes with multiplication.

    This can be fixed by replacing ^ with +. On top of that, the hashing code for tuples looks needlessly complicated. A simple Bernstein hash suffices.

    rhettinger commented 5 years ago

    It's not hard to come up with a hash collision for tuples:

    Nor is it hard to come up with collisions for most simple hash functions. For typical use cases of tuples, what we have works out fine (it just combined the effects of the underlying hashes).

    The underlying reason is that the hashing code mixes ^ and *.

    Addition also has a relationship to multiplication. I think the XOR is fine.

    A simple Bernstein hash suffices.

    That is more suited to character data and small hash ranges (as opposed to 64-bit).

    On top of that, the hashing code for tuples looks needlessly complicated.

    The important thing is that it has passed our testing (see the comment above) and that it compiles nicely (see the tight disassembly below).

    ----------------------

    L82: xorq %r14, %rax imulq %rbp, %rax leaq 82518(%rbp,%r12,2), %rbp movq %r13, %r12 movq %rax, %r14 leaq -1(%r13), %rax cmpq $-1, %rax je L81 movq %rax, %r13 L73: addq $8, %rbx movq -8(%rbx), %rdi call _PyObject_Hash cmpq $-1, %rax jne L82

    jdemeyer commented 5 years ago

    Nor is it hard to come up with collisions for most simple hash functions.

    The Bernstein hash algorithm is a simple algorithm for which it can be proven mathematically that collisions cannot be generated if the multiplier is unknown. That is an objective improvement over the current tuple hash. Of course, in practice the multiplier is not secret, but it suffices that it is random enough.

    Addition also has a relationship to multiplication.

    Of course, but not in a way that can be used to generate hash collisions. In fact, the simple relation between addition and multiplication makes this an actual provable mathematical statement.

    That is more suited to character data and small hash ranges (as opposed to 64-bit).

    Maybe you are thinking of the multiplier 33 which is classically used for character data? If you replace the multiplier 33 by a larger number like _PyHASH_MULTIPLIER == 1000003 (ideally it would be a truly random number), it works fine for arbitrary sequences and arbitrary bit lengths.

    The important thing is that it has passed our testing

    That's a silly argument. If it passed testing, it is only because the testing is insufficient.

    But really: what I am proposing is a better hash without obvious collisions which won't be slower than the existing hash. Why would you be against that? Why should we "roll our own" hash instead of using a known good algorithm?

    rhettinger commented 5 years ago

    ISTM, you're being somewhat aggressive and condescending. I'll hand you off to Tim who is far better at wrestling over the details :-)

    From my point of view, you've made a bold and unsubstantiated claim that Python's existing tuple hash has collision issues with real code. Only a contrived and trivial integer example was shown.

    On my end, I made an honest effort to evaluate the suggestion. I read your post, looked up Bernstein hash, revisited the existing C code and found that everything it is doing makes sense to me. There was a comment indicating that empirical tests were run to get the best possible constants. Also, I disassembled the code to verify that it is somewhat efficiently compiled.

    Your responded that I'm being "silly" and then you reversed my decision to close the issue (because there is no evidence that the hash is broken). Accordingly, I'm going to hand it over to Tim who I'm sure will know the history of the existing code, can compare and contrast the various options, evaluate them in the context of typical Python use cases, and assess whether some change is warranted.

    tim-one commented 5 years ago

    @jdemeyer, please define exactly what you mean by "Bernstein hash". Bernstein has authored many hashes, and none on his current hash page could possibly be called "simple":

    https://cr.yp.to/hash.html

    If you're talking about the teensy

        hash = hash * 33 + c;

    thing from now-ancient C folklore, Bernstein is claimed to have replaced addition with xor in that later:

    http://www.cse.yorku.ca/~oz/hash.html

    In any case, Python _used_ to do plain

        x = (1000003 * x) ^ y;
    
    but changed to the heart of the current scheme 14 years ago to address Source Forge bug python/cpython#40190:
        https://github.com/python/cpython/commit/41bd02256f5a2348d2de3d6e5fdfcaeb2fcaaebc#diff-cde4fb3109c243b2c2badb10eeab7fcd

    The Source Forge bug reports no longer appear to exist. If you care enough, dig through the messages with subject:

    [ python-Bugs-942952 ] Weakness in tuple hash

    here:

    https://mail.python.org/pipermail/python-bugs-list/2004-May/subject.html

    The thrust was that the simpler approach caused, in real life, hashes of nested tuples of the form

    (X, (X, Y))

    to return hash(Y) - the hash(X) parts across calls magically xor'ed out.

    It was also systematically the case that

    (X, (Y, Z))

    and

    (Y, (X, Z))

    hashed to the same thing.

    So the complications were added for a reason, to address the showed-up-in-real-life pathological cases discussed in the bug report already linked to. Since I don't believe we've had other reports of real-life pathologies since, nobody is going to change this again without truly compelling reasons. Which I haven't seen here, so I'm closing this report again.

    BTW, as also explained in that report, the goal of Python's hash functions is to minimize collisions in dicts in real life. We generally couldn't care less whether they "act random", or are hard to out-guess. For example,

     assert all(hash(i) == i for i in range(1000000))

    has always succeeded.

    ned-deily commented 5 years ago

    (FWIW, I believe the Source Forge bug reports were forwarded ported to bugs.python.org. This looks like the issue in question: https://bugs.python.org/issue942952 )

    tim-one commented 5 years ago

    Ah! I see that the original SourceForge bug report got duplicated on this tracker, as PR bpo-942952. So clicking on that is a lot easier than digging thru the mail archive.

    One message there noted that replacing xor with addition made collision statistics much worse for the cases at issue.

    jdemeyer commented 5 years ago

    ISTM, you're being somewhat aggressive and condescending.

    Interestingly, I felt the same with your reply, which explains my rapid reversion of your closing of the ticket. The way I see it: I pointed out a bug in tuple hashing and you just closed that, basically claiming that it's not a bug.

    I still don't understand why you are closing this so quickly. To put it differently: if I can improve the hash function, removing that obvious collision, extending the tests to catch this case, while not making it slower, why shouldn't this be done?

    jdemeyer commented 5 years ago

    Since I don't believe we've had other reports of real-life pathologies since

    You just did in this bug report. Why doesn't that count?

    tim-one commented 5 years ago

    @jdemeyer, you didn't submit a patch, or give any hint that you _might. It _looked like you wanted other people to do all the work, based on a contrived example and a vague suggestion.

    And we already knew from history that "a simple Bernstein hash suffices" is false for tuple hashing (for one possible meaning of "a simple Bernstein hash" - we had one of those 14 years ago).

    Neither Raymond nor I want to spend time on this ourselves, so in the absence of somebody else volunteering to do the work, there's no reason to keep the report open uselessly.

    If you're volunteering to do the work ... then do the work ;-)

    jdemeyer commented 5 years ago

    To come back to the topic of hashing, with "Bernstein hash" for tuples I did indeed mean the simple (in Python pseudocode):

    # t is a tuple
    h = INITIAL_VALUE
    for x in t:
        z = hash(x)
        h = (h * MULTIPLIER) + z
    return h

    I actually started working on the code and there is indeed an issue with the hashes of (X, (Y, Z)) and (Y, (X, Z)) being equal that way. But that can be solved by replacing "hash(x)" in the loop by something slightly more complicated. After thinking about it, I decided to go for shifting higher to lower bits, so replacing z by z^(z >> 32) or something like that.

    This also has the benefit of mixing the high-order bits into the low-order bits, which is an additional improvement over the current code.

    Since I already wrote the code anyway, I'll create a PR for it.

    tim-one commented 5 years ago

    You said it yourself: "It's not hard to come up with ...". That's not what "real life" means. Here:

    >>> len(set(hash(1 << i) for i in range(100_000)))
    61

    Wow! Only 61 hash codes across 100 thousand distinct integers?!

    Yup - and I don't care. I contrived the case, just like you apparently contrived yours.

    jdemeyer commented 5 years ago

    For the record: my collision is not contrived. It came up in actual code where I was hashing some custom class using tuples and found an unexpected high number of collisions, which I eventually traced back to the collision I reported here.

    By the way, I think the worst real-life hash collision is

    >>> hash(-1) == hash(-2)
    True
    ericvsmith commented 5 years ago

    I agree with Raymond and Tim here: while it's inevitably true that there are many possible hash collisions, we'd need to see where the current algorithm caused real problems in real code. Pointing out a few examples where there was a collision isn't very helpful.

    So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes. What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

    jdemeyer commented 5 years ago

    So, jdemeyer, if it's possible to show (or describe) to us an example of a problem you had, such that we could repeat it, that would be helpful (and necessary) to evaluate any proposed changes. What were the inputs to hash() that caused a problem, and how did that problem manifest itself?

    In all honesty, I don't remember. This was a while ago and at that time I didn't care enough to put up a CPython bug report. Still, this is a collision for tuples of short length (3) containing small integers (0 and -2). Why do you find that contrived?

    What prompted me to report this bug now anyway is that I discovered bad hashing practices for other classes too. For example, meth_hash in Objects/methodobject.c simply XORs two hashes and XOR tends to suffer from catastrophic cancellation. So I was hoping to fix tuple hashing as an example for other hash functions to follow.

    jdemeyer commented 5 years ago

    I should also make it clear that the collision hash((1,0,0)) == hash((1,-2,-2)) that I reported is due to the algorithm, it's not due to some bad luck that 2 numbers happen to be the same. There are also many more similar hash collisions for tuples (all involving negative numbers) due to the same underlying mathematical reason.

    I still don't understand why you don't consider it a problem. It may be a tiny problem, not really affecting the correct functioning of CPython. But, if we can fix this tiny problem, why shouldn't we?

    ericvsmith commented 5 years ago

    I'm not one to judge the effectiveness of a new hash algorithm. But my personal concern here is making something else worse: an unintended consequence. IMO the bar for changing this would be very high, and at the least it would need to be addressing a known, not theoretical, problem.

    That said, I'll let the experts debate your proposed change.

    rhettinger commented 5 years ago

    ISTM the collision of "hash((1,0,0)) == hash((1,-2,-2))" also stems from integers hashing to themselves and that twos-complement arithmetic relation, -x == \~x+1. Further, that example specifically exploits that "hash(-1) == hash(-2)" due to how error codes. In a way, tuple hashing is incidental.

    Occasional collisions aren't really a big deal -- it is normal for dicts to have some collisions and to resolve them. I would be more concerned is non-contrived datasets had many elements that gave exactly the same hash value resulting in a pile-up in one place.

    FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

        >>> from itertools import product
        >>> len(set(map(hash, product(range(100), repeat=4))))
        100000000

    The current hash is in the pretty-darned-good category and so we should be disinclined to change battle-tested code that is known to work well across a broad range of use cases.

    jdemeyer commented 5 years ago

    Further, that example specifically exploits that "hash(-1) == hash(-2)"

    No, it does not. There is no hashing of -1 involved in the example hash((1,0,0)) == hash((1,-2,-2)).

    tim-one commented 5 years ago

    For me, it's largely because you make raw assertions with extreme confidence that the first thing you think of off the top of your head can't possibly make anything else worse. When it turns out it does make some things worse, you're equally confident that the second thing you think of is (also) perfect.

    So far we don't even have a real-world test case, let alone a coherent characterization of "the problem":

    """ all involving negative numbers) due to the same underlying mathematical reason. """

    is a raw assertion, not a characterization. The only "mathematical reason" you gave before is that "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m". Which is true, and a good observation, but doesn't generalize as-is beyond -2.

    Stuff like this from the PR doesn't inspire confidence either:

    """ MULTIPLIER = (1000003)**3 + 2 = 1000009000027000029: the multiplier should be big enough and the original 20-bit number is too small for a 64-bit hash. So I took the third power. Unfortunately, this caused a sporadic failure of the testsuite on 32-bit systems. So I added 2 which solved that problem. """

    Why do you claim the original was "too small"? Too small for what purpose? As before, we don't care whether Python hashes "act random". Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat? What was _the cause of the "sporadic failure" (whatever that means), and why did adding 2 fix it? Why isn't there a single word in _the code about where the mystery numbers came from?:

    You're creating at least as many mysteries as you're claiming to solve.

    We're not going to change such heavily used code on a whim.

    That said, you could have easily enough _demonstrated_ that there's potentially a real problem with a mix of small integers of both signs:

    >>> from itertools import product
    >>> cands = list(range(-10, -1)) + list(range(9))
    >>> len(cands)
    18
    >>> _ ** 4
    104976
    >>> len(set(hash(t) for t in product(cands, repeat=4)))
    35380

    And that this isn't limited to -2 being in the mix (and noting that -1 wasn't in the mix to begin with):

    >>> cands.remove(-2)
    >>> len(cands) ** 4
    83521
    >>> len(set(hash(t) for t in product(cands, repeat=4)))
    33323

    If that's "the problem", then - sure - it _may_ be worth addressing. Which we would normally do by looking for a minimal change to code that's been working well for over a decade, not by replacing the whole thing "just because".

    BTW, continuing the last example:

    >>> c1 = Counter(hash(t) for t in product(cands, repeat=4))
    >>> Counter(c1.values())
    Counter({1: 11539, 2: 10964, 4: 5332, 3: 2370, 8: 1576, 6: 1298, 5: 244})

    So despite that there were many collisions, the max number of times any single hash code appeared was 8. That's unfortunate, but not catastrophic.

    Still, if a small change could repair that, fine by me.

    tim-one commented 5 years ago

    Oops!

    """ "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m" """

    The tail end should say "m(j^(-2)) == -m\j" instead.

    jdemeyer commented 5 years ago

    FWIW, the current algorithm also has some advantages that we don't want to lose. In particular, the combination of the int hash and tuple hash are good at producing distinct values for non-negative numbers:

        >>> from itertools import product
        >>> len(set(map(hash, product(range(100), repeat=4))))
        100000000

    FWIW: the same is true for my new hash function

    jdemeyer commented 5 years ago

    Concerning the MULTIPLIER:

    Why do you claim the original was "too small"? Too small for what purpose?

    If the multiplier is too small, then the resulting hash values are small too. This causes collisions to appear for smaller numbers:

    BEFORE:
    >>> L = [(a,b) for a in range(2) for b in range(2**21)]
    >>> len(L), len(set(hash(x) for x in L))
    (4194304, 2097152)
    
    AFTER:
    >>> L = [(a,b) for a in range(2) for b in range(2**21)]
    >>> len(L), len(set(hash(x) for x in L))
    (4194304, 4194304)

    Again, not a huge problem, but at least there is room for improvement.

    Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat? What was _the cause_ of the "sporadic failure" (whatever that means)

    With "sporadic failure", I mean a hash collision not due to the algorithm but due to two numbers which happen to be the same. This might not even affect actual usage, but it did cause the testsuite to fail on 32 bit machines (48 collisions if I recall correctly while only 20 are allowed).

    and why did adding 2 fix it?

    Because it's really just random chance. Some constants just happen to produce more collisions on the specific testsuite input. It should also be said that, due to the structure of that input, collisions are not independent. For example, if hash((a,b,c)) == hash((d,e,f)), then also hash((a,b,c,x)) == hash((d,e,f,x)) for example.

    Why isn't there a single word in _the code_ about where the mystery numbers came from?

    Ideally, it should be just a random number. I can fix that by using a "Nothing up my sleeve number".

    jdemeyer commented 5 years ago

    I pushed a minor update to the PR, changing the MULTIPLIER and explaining the chosen constants.

    tim-one commented 5 years ago

    > Why do you claim the original was "too small"? Too small for > what purpose?

    If the multiplier is too small, then the resulting hash values are small too. This causes collisions to appear for smaller numbers:

    All right! An actual reason ;-) So there are two things so far I clearly agree with, but they're independent of "the problem" this report was opened about:

    1. On a 64-bit box life would be better if we picked a bigger multiplier. This should be done via preprocessor #if selection, though, not via the inscrutable rules C uses to cut back an integer literal out of range for the type of the variable it's assigned to.

    2. A raw integer hash of -1 should be changed to something other than -2. It was always a Poor Idea to pick an integer of small magnitude for the substitute. Some newer types didn't repeat this mistake. For example, frozensets replace a raw hash of -1 with

    3. Alas, this is hard to change now, because hashable objects of any type that compare equal to -1 also need to return the same substitute value (e.g., hash(-1.0) == hash(decimal.Decimal("-100e-2")) == hash(-1+0j) == -2). So I'm afraid it's been hard-coded into user-defined numeric classes too :-( There would be no obvious problem in changing the _tuple_ hash to use a different substitute value, although it's not clear that would buy anything either. hash(-1) == -2 was the original sin.

    [about multipliers]

    Because it's really just random chance. ... Ideally, it should be just a random number.

    How do you know this? I'm not aware of any research papers about picking multipliers in this context, but would love to see one.

    I am aware of much research about multipliers in the somewhat related contexts of linear congruential pseudo-random number generators, and of "multiplicative hashing", and "any random multiplier is fine" couldn't be further from the truth _in_ those contexts.

    Guido picked a prime to begin with (actually, at first it was 3(!)), and comments in the current code sure say that whoever added this part was keen on primes too:

    """ /* The addend 82520, was selected from the range(0, 1000000) for generating the greatest number of prime multipliers for tuples up to length eight: """

    I don't know that primes are important here, but neither do I know that they're _not_ important here. Widely used production code isn't the place to conduct experiments, so the status quo is favored in the absence of truly compelling reasons. Just repeating the raw assertion "random is fine" isn't compelling. If it were true, we could, e.g., multiply by a large power of 2 via shifting instead, and save a few cycles.

    For my best guess at what "the problem" here actually is, in the

    >> cands = list(range(-10, -1)) + list(range(9)) >> len(set(hash(t) for t in product(cands, repeat=4)))

    example, which currently (on my 64-bit box) generates 35,380 distinct hash codes from the 104,976 inputs, ALL the collisions go away merely by changing the character "^" to "+" in the current code.

    Which was your original suggestion. Which you appear to be opposed to now? I'm unclear about why, if so.

    That's a change I could probably live with, although I have no strong reason to believe it couldn't cause problems for applications that currently work fine. I don't have more time to think about that now, though.

    jdemeyer commented 5 years ago

    I don't know that primes are important here, but neither do I know that they're _not_ important here.

    Hashes are effectively computed modulo 2**N. "Primes" are meaningless in that setting (except for the prime 2 which does have a meaning). For example, 1000003 is prime but 1000003 + 2**64 is not. But these represent the same number modulo 2**64. Also, multiplication by any odd number is a permutation modulo 2**N, so every odd number is invertible.

    jdemeyer commented 5 years ago

    Which was your original suggestion. Which you appear to be opposed to now? I'm unclear about why, if so.

    I'm not strictly opposed to that. It's just that I have less confidence in the current ad-hoc hash compared to something following the DJBX33A scheme.

    jdemeyer commented 5 years ago

    I'm not aware of any research papers about picking multipliers in this context, but would love to see one.

    The only real condition that I can think of is that the order should be large: we do not want MULTIPLIER**n = 1 (mod 2**N) for a small number n.

    Other than that, we could pick the multiplier to guarantee no hash collisions on some chosen subset of inputs. A bit like your product(range(100), repeat=4) example but then for more inputs.

    If you agree with this approach, I could try to find a good multiplier this way.

    tim-one commented 5 years ago

    So you don't know of any directly relevant research either. "Offhand I can't see anything wrong" is better than nothing, but very far from "and we know it will be OK because [see references 1 and 2]".

    That Bernstein's DJBX33A has been widely used inspires no confidence whatsoever. As already explained, Python _did_ use DJBX33X (with a different multiplier), and it systematically screwed up catastrophically in some nested tuple cases already spelled out. Bernstein himself moved to DJBX33X (using xor instead of addition), and that would have screwed up too on a mix of smallish integers of both signs.

    What Python does now, except for changing the multiplier, is essentially version 1a of the _also_ very widely used - but more recent - Fowler/Noll/Vo hash family[1]:

        # Version 1a, recommended over version 1 (which does * before ^).
        hash = offset_basis
        for each octet_of_data to be hashed
            hash = hash xor octet_of_data
            hash = hash * FNV_prime
        return hash

    They did extensive studies, and very deliberately use a prime multiplier subject to a number of other constraints. Not based on "offhand I can't see why not", but based on analysis and running extensive tests.

    But, same as with Bernstein's variants (which predate FNV's work on picking "good" multipliers), they were developed in the context of hashing sequences of unsigned 8-bit integers. There's no a priori reason I can see to expect them to work well in contexts other than that. Indeed, FNV puts special constraints on the last 8 bits of the primes they pick, presumably because they're only concerned with hashing sequences of 8-bit values.

    FYI, for 32-bit hashes they use multiplier 16777619, and for 64-bit 1099511628211.

    In the absence of coherent analysis directly relevant to what Python actually needs here, we're all flying blind. What we do have is over a decade of positive real-world experience with the made-up hacks Python used to worm around a class of gross flaws its prior DJBX33X approach suffered, taking DJBX33X out of its original context and applying it in an area it wasn't designed for. Which made it close to FNV-1a, but also in a context _that_ wasn't designed for.

    However, it _is structurally close to FNV-1a, and using prime multipliers was a big deal to them. "But offhand I don't see why" isn't a good enough reason to abandon that. To the contrary, in the absence of _proof that it doesn't actually matter, I'd be happiest using the same multipliers (given above) and initial values "the standard" FNV-1a algorithms use.

    [1] http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a

    jdemeyer commented 5 years ago

    Thanks for the reference, I never heard of the FNV hash. Unfortunately, I haven't seen a reference for the rationale of how they pick their multiplier.

    It's not clear what you are suggesting though. Keep using the FNV-ish hash somehow? Or using a DJBX33A hash with the FNV multiplier?

    jdemeyer commented 5 years ago

    the made-up hacks Python used to worm around a class of gross flaws its prior DJBX33X approach suffered, taking DJBX33X out of its original context and applying it in an area it wasn't designed for.

    But we know why the DJBX33A hash didn't work (nested tuples), so we can think of the best way to solve that. Python messes with the multiplier, which makes it quite a different hash. Surely, if you believe that the precise choice of multiplier matters a lot, then you should also agree that arbitrarily changing the multiplier in the loop is a bad idea.

    My proposal instead is to keep the structure of the DJBX33A hash but change the hash of the individual items to be hashed. That's a much less invasive change to the known algorithm.

    Finally, something that I haven't mentioned here: an additional advantage of my approach is that high-order bits become more important:

    BEFORE:
    >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
    500000
    
    AFTER:
    >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
    1000000

    Again, I'm not claiming that this is a major issue. Just additional evidence that maybe my new hash might actually be slightly better than the existing hash.

    rhettinger commented 5 years ago

    I don't think any change should be made unless we agree that there is a real problem to be solved rather than because the OP is brazenly insistent on forcing through some change. We shouldn't feel shoved into altering something that we don't agree is broken and into replacing it with something that we're less sure about. Also, I'm concerned that that patch drops features like the post-addition of a constant and incorporating length signature -- it is as if we're starting from scratch rather than keeping the incremental improvements made over decades. AFAICT, the only motivation for change is that the OP is relentless; otherwise, none of us would be giving this further thought. I echo Eric's concern that it is easy to inadvertently make Python worse.

    tim-one commented 5 years ago

    I strive not to believe anything in the absence of evidence ;-)

    FNV-1a supplanted Bernstein's scheme in many projects because it works better. Indeed, Python itself used FNV for string hashing before the security wonks got exercised over collision attacks. It worked great. But "better" depends on what you value. For example, FNV-1a has far better "avalanche" behavior than Bernstein[1].

    Please note that I don't _object to your code! It may work fine. Or it may not in some cases. The problem is that we have no way to tell. There's no theory here, just misleading appeals to experience with the algorithms in contexts they were _intended to be used in. Nobody expected some patterns of nested tuples to fail catastrophically before, and nobody expected mixed-sign integers to lead to poor (but not catastrophic) behavior after the former was "repaired". Nobody now expects the next 10 catastrophes either.

    We can only rely on contrived tests and bug reports. Python's current scheme is unbeatable on that count, because it's the only scheme for which over a decade of experience _in context_ exists.

    Which experience says there are no known catastophically bad real cases. Which is worth a whole lot in a project that's so widely used now.

    That said, the "repair" before was at least as much pure hackery as your proposed hackery, and - I agree - completely undercut the _theory FNV was based on (although, to be fair, I doubt anyone else _knew they were re-inventing a damaged FNV-1a at the time).

    So ... since FNV-1a is in fact better in its intended context than the Bernstein one-liners in the same context, how about adding your

    t += (t >> (4 * sizeof(t)));

    hack to the _current_ code (& delete the code changing the multiplier)? Then we could switch to the "official" FNV 32-bit and 64-bit multipliers too, and know at least that "almost everything" about the resulting algorithm is known to work exceedingly well in its original context.

    [1] https://sites.google.com/site/murmurhash/avalanche

    tim-one commented 5 years ago

    Raymond, I share your concerns. There's no reason at all to make gratuitous changes (like dropping the "post-addition of a constant and incorporating length signature"), apart from that there's no apparent reason for them existing to begin with ;-)

    Unintended consequences can abound.

    Still, the last repair was pretty hacky, and a very well-known and highly respected algorithm (FNV-1a) _is_ hiding under it. I would like to pursue seeing whether a more direct form of FNV-1a can be rehabilitated to worm around the known problematic cases. That would also, if successful, give us a principled way to pick a better multiplier for 64-bit boxes.

    jdemeyer commented 5 years ago

    We shouldn't feel shoved into altering something that we don't agree is broken

    It *is* broken. You are just denying the reality.

    That's also the reason that I'm insisting so much: I don't want to push my personal fix (despite that it may seem so), I want to fix a bug.

    tim-one commented 5 years ago

    Oh, I don't agree that it's "broken" either. There's still no real-world test case here demonstrating catastrophic behavior, neither even a contrived test case demonstrating that, nor a coherent characterization of what "the problem" is.

    I'm nevertheless open to making improvements, but keeping in mind foremost that _all changes are potentially damaging to patterns of data we know nothing about precisely because they've been working fine for many years. So, e.g., there's no chance that gratuitous changes will be accepted. For example, don't _like adding 97531UL at the end? Tough luck - it's staying, and it's not worth one word of argument.

    To my eyes, the _strongest_ case in all these messages is for boosting the multiplier size on 64-bit boxes. That's principled and well-motivated. Python itself changed the multiplier from 3 to 1000003 long ago for the same reasons. But that apparently has nothing to do with "the problem" this report was opened about ;-)

    rhettinger commented 5 years ago

    To my eyes, the _strongest_ case in all these messages is for boosting the multiplier size on 64-bit boxes. That's principled and well-motivated.

    Yes, that would be perfectly reasonable (though to some extent the objects in the tuple also share some of the responsibility for getting all bits into play).

    tim-one commented 5 years ago

    Has anyone figured out the real source of the degeneration when mixing in negative integers? I have not. XOR always permutes the hash range - it's one-to-one. No possible outputs are lost, and XOR with a negative int isn't "obviously degenerate" in general:

    >>> for i in range(-16, 17):
    ...    print(i, i ^ -5)
    -16 11
    -15 10
    -14 9
    -13 8
    -12 15
    -11 14
    -10 13
    -9 12
    -8 3
    -7 2
    -6 1
    -5 0
    -4 7
    -3 6
    -2 5
    -1 4
    0 -5
    1 -6
    2 -7
    3 -8
    4 -1
    5 -2
    6 -3
    7 -4
    8 -13
    9 -14
    10 -15
    11 -16
    12 -9
    13 -10
    14 -11
    15 -12
    16 -21

    Clear enough? "Distinct integers in, distinct integers out", but with the twist that the sign bit flips. That's _seemingly_ minor to me. Yet it has massive effects on tuple hash distribution. Here's a function to show that:

        def doit(cands, repeat=4):
            from collections import Counter
            from itertools import product
        print(len(cands), "cands **", repeat, "=", len(cands)**repeat)
        c1 = Counter(map(hash, product(cands, repeat=repeat)))
        print(len(c1), "distinct hash codes")
        c2 = Counter(c1.values())
        for dups in sorted(c2):
            print(dups, c2[dups])

    Then an example we're proud of:

    >>> doit(range(100))
    100 cands ** 4 = 100000000
    100000000 distinct hash codes
    1 100000000

    Much the same, mixing in negative ints, but _excluding_ the problematic -1 and -2 inputs:

    >>> cands = list(range(-50, 51))
    >>> cands.remove(-1) # hashes to -2
    >>> cands.remove(-2) # for j odd, j ^ -2 == -j
    >>> doit(cands)
    99 cands ** 4 = 96059601
    15736247 distinct hash codes
    1 70827
    2 1005882
    3 228578
    4 5000424
    5 19728
    6 1047762
    8 8363046

    What accounts for such a massive difference? It's not merely that we're using negative ints:

    >>> doit(range(-101, -1)) # leave -2 in, for a change
    100 cands ** 4 = 100000000
    100000000 distinct hash codes
    1 100000000

    So, on their own, negative ints are as spectacularly well-behaved in this range as positive ints, and including -2 creates no problems.

    I suspect it's related to that x ^ -(x + 1) == -1, but not in a way obvious enough to be ... well, obvious ;-)

    tim-one commented 5 years ago

    [Raymond, on boosting the multiplier on 64-bit boxes]

    Yes, that would be perfectly reasonable (though to some extent the objects in the tuple also share some of the responsibility for getting all bits into play).

    It's of value independent of that. Tuples of ints used as keys and set elements are very important, and I doubt we'll ever give up on that hash(i) == i for the typically "not huge" ints used in such contexts. Jeroen gave a reasonable example of how boosting the multiplier can help in a case of that above:

    https://bugs.python.org/msg326032

    tim-one commented 5 years ago

    FYI, using this for the guts of the tuple hash works well on everything we've discussed. In particular, no collisions in the current test_tuple hash test, and none either in the cases mixing negative and positive little ints. This all remains so using the current multiplier, or "the standard" FNV-1a multipliers 16777619UL (32 bit) and 1099511628211UL (64 bit). I've done no other tests:

    while (--len >= 0) {
            y = PyObject_Hash(*p++);
            if (y == -1)
                return -1;
            Py_uhash_t t = (Py_uhash_t)y;
            t ^= t << 7;
            x = (x ^ t) * mult;
    }

    Note that the multiplier doesn't change anymore. The twist is adding

        t ^= t \<\< 7;

    to _permute_ the native hash space (stuff like "t += t >> 16" is a many-to-one function, not a permutation). Other than that, it's exactly FNV-1a. I don't know that 7 is especially magical - it's just the first thing I tried.

    In the absence of a real analysis, the intuition is simply that "t ^= t \<\< 7" will clear masses of leading sign bits when hashing "small" negative integers. For whatever reason(s), they appear to be the cause of the troubles.

    However, since that line of code permutes the space, exactly the same statistics can be provoked by some other inputs.

    tim-one commented 5 years ago

    BTW, those tests were all done under a 64-bit build. Some differences in a 32-bit build:

    1. The test_tuple hash test started with 6 collisions. With the change, it went down to 4. Also changing to the FNV-1a 32-bit multiplier boosted it to 8. The test passes in all those cases.

    2. For 4-tuples taken from range(-50, 51) omitting -1 and -2: massive number of collisions under the current code. Died with MemoryError with the change - ran out of memory to build the Counter recording all the distinct hash codes.

    Changing it to 3-tuples instead: somewhere around 140 collisions with the change, and also changing to the FNV-1a 32-bit multiplier eliminated all collisions.

    All of this was done under Win 10, using Visual Studio 2017.

    jdemeyer commented 5 years ago

    Has anyone figured out the real source of the degeneration when mixing in negative integers?

    The underlying reason for the collisions is the following mathematical relation:

    x ^ -x = -2 \<\< i

    where i is the index of the smallest set bit of x. In particular, x ^ -x = -2 for odd x.

    Now consider two consecutive hash iterations:

    y = (x ^ a) * m1
    z = (y ^ b) * m2

    and suppose that x ^ a is odd. Now if we replace a by a ^ -2, then x ^ a will be replaced by -(x ^ a) and y will be replaced by -y. If we also replace b by b ^ -2, then y ^ b will be replaced by y ^ b. In other words, we have a collision.

    This kind of bad interaction between ^ and * only happens with the FNV-style hashing. The Bernstein hash using + instead of ^ does not suffer from this problem. That makes it a better choice in my opinion.

    jdemeyer commented 5 years ago

    While writing up the analysis above, it occurred to me that collisions already happen for 2-tuples:

    >>> hash((3, -2)) == hash((-3, 0))
    True

    These kind of 2-tuples of small integers don't look contrived at all. I can easily see them appearing, in mathematical applications for example.

    As for real-world usage: the only thing that I can say is that I discovered these hash collisions a while ago, while working on SageMath. I was testing the hash for a custom class and I found collisions, which I traced back to collisions for tuples.

    In any case, it is hard to find real-world problems where a bad hash really matters, since Python works fine with a broken hash too.

    jdemeyer commented 5 years ago

    stuff like "t += t >> 16" is a many-to-one function, not a permutation

    Yes, I am aware of that. However, the number of collisions here is really quite small. It's very unlikely to hit one by accident.

    I also chose >> over \<\< for two reasons:

    1. It brings the high-order in play: https://bugs.python.org/msg326117

    2. It avoids collisions on the low-order bits: when you do t ^= t \<\< 7, then you are not changing the lower 7 bits at all. So applications using hash(x) % 128 will still see all the problems that we are trying to fix.

    jdemeyer commented 5 years ago

    For example, FNV-1a has far better "avalanche" behavior than Bernstein

    We don't care about that though. We want to have no collisions, not a random output.

    jdemeyer commented 5 years ago

    In the absence of a real analysis, the intuition is simply that "t ^= t \<\< 7" will clear masses of leading sign bits when hashing "small" negative integers.

    That's a clever solution. If you want to go that route, I would rather suggest t ^= t \<\< 1 which is essentially the same fix but which has probably less collisions on the low-order bits.

    jdemeyer commented 5 years ago

    I created a new PR based on Tim's t ^= t \<\< 7 idea, except that I'm using \<\< 1 instead, to have more mixing in the lower bits.

    With the standard FNV multiplier on 64 bits, I did get collisions while testing. I haven't figured out exactly why these occurred, but it's probably due to the high number of 0 bits. Instead, I chose 3**41 as multiplier. But of course, there are still plenty of bikeshedding opportunities for the multiplier...

    tim-one commented 5 years ago

    when you do t ^= t \<\< 7, then you are not changing the lower 7 bits at all.

    I want to leave low-order hash bits alone. That's deliberate.

    The most important tuple component types, for tuples that are hashable, are strings and contiguous ranges of "not huge" ints. The current string hash works hard to "act randomly" in every bit position - there's no point in trying to "improve" anything about the hashes it already produces.

    In contrast, the hashes of any contiguous range of N not-huge integers (excluding -1) are already, by construction, guaranteed to have no collisions at all in their low-order (roughly) log2(N) bits. They can't be improved in this respect, because they're already optimal in this respect.

    So, if anything, I'd look at increasing the left shift rather than reducing it. The low bits are already un-improvable in the most important cases.

    So applications using hash(x) % 128 will still see all the problems that we are trying to fix.

    ? The only "problem" I've seen identified was in mixing positive and negative integers. Here on a 32-build with the FNV-1a 32-bit multiplier:

    >> from itertools import product
    >> from collections import Counter
    >> cands = list(range(-50, 51))
    >> cands.remove(-1)
    >> c = Counter()
    >> for t in product(cands, repeat=4):
    ..     c[hash(t) & 0x7f] += 1
    >>> len(c)
    128

    So all 128 lower 7-bit patterns showed up. And they're quite evenly distributed:

    >>> c2 = Counter(c.values())
    >>> for k in sorted(c2):
    ...     print(k, c2[k])
    ...
    781202 1
    781207 1
    781209 2
    781212 1
    781213 2
    781214 1
    781215 4
    781221 3
    781222 1
    781225 3
    781226 1
    781227 3
    781228 2
    781229 2
    781230 1
    781231 1
    781232 2
    781233 3
    781234 1
    781235 4
    781236 2
    781237 1
    781238 2
    781240 5
    781241 6
    781242 1
    781243 1
    781244 1
    781245 1
    781247 1
    781248 2
    781251 2
    781252 4
    781253 3
    781254 5
    781255 2
    781256 2
    781257 3
    781259 2
    781260 1
    781261 1
    781262 1
    781263 2
    781264 4
    781265 2
    781266 1
    781267 1
    781268 4
    781269 1
    781270 1
    781271 2
    781272 1
    781274 2
    781275 1
    781276 1
    781278 1
    781280 1
    781281 2
    781282 2
    781285 1
    781286 2
    781288 1
    781295 1
    781297 2
    781301 1
    781302 1
    781304 1
    781307 1

    With the standard FNV multiplier on 64 bits, I did get collisions while testing.

    When testing what, specifically? And the standard 32-bit FNV multiplier, or the standard 64-bit FNV multiplier?

    Instead, I chose 3**41 as multiplier. But of course, there are still plenty of bikeshedding opportunities for the multiplier...

    Not for me. If the universally used 64-bit FNV multiplier can't be used in the context of Python's tuple hash fiddled to use an almost-pure form of FNV-1a, then that approach is dead to me. Needing to pick multipliers out of thin air instead implies the theory underlying FNV-1a doesn't transfer to this context.

    About which I have no current opinion. It may or may not. Since we're flying blind, I'm just willing to _assume_ it does until proven otherwise by testing.

    tim-one commented 5 years ago

    Jeroen, I understood the part about -2 from your initial report ;-) That's why the last code I posted didn't use -2 at all (neither -1, which hashes to -2). None of the very many colliding tuples contained -2 in any form. For example, these 8 tuples all have the same hash now:

    (-4, -4, -4, 40) (-4, -4, 4, -48) ( 4, 4, -4, 40) ( 4, 4, 4, -48) (-4, 28, -28, -48) (-4, 28, 28, 40) ( 4, -28, -28, -48) ( 4, -28, 28, 40)

    Your last example (with (3, -2) and (-3, 0)) also implicitly relies on that:

    j is even implies (j ^ -3) == -(j ^ 3)

    There are apparently piles of similar identities :-(

    I appreciate that

    a*M + C = b*M + C (mod N)

    implies

        a = b (mod N)

    when M is coprime to N, and also that the theory of linear combinations modulo 2**K is far better known than the well-hidden theory FNV developed.

    tim-one commented 5 years ago

    advantage of my approach is that high-order bits become more important:

    I don't much care about high-order bits, beyond that we don't systematically _lose_ them. The dict and set lookup routines have their own strategies for incorporating high-order bits into the probe sequence, and those routines are the primary reason hash() exists in Python.

    So there's scant reason to try to let high bits influence low bits (but good reason _not_ to, because - as explained before - the low-order bits are already in great shape for the important tuple component types).

    Something else appears to be going on in this specific example, which has a very touchy set of hash codes:

    BEFORE: >>> L = [n \<\< 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T)) 500000

    Which is what I get too on (at least) a 64-bit build.

    AFTER: >>> L = [n \<\< 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T)) 1000000

    Ditto. However, I get exactly the same from the code I posted before, which has no right shifts at all.

    More, if I take the code exactly as it exists today, and merely comment out the

    mult += (Py_hash_t)(82520UL + len + len);

    line, I also get a million distinct hash codes.

    Half of the hash codes have bit 2**60 set, and apart from those all the hash codes have no bits set higher than 2**6. Any number of accidents could cause the 2**60 bits to wipe each other out, and merely changing the sequence of multipliers was demonstrably enough to stop that.

    Indeed, merely changing 82520 to 82524 in the current code is enough to get a million unique hash codes in this case.