Specializing adaptive interpreter code object hashes are less unique

larryhastings commented 2 years ago

In Python 3.11 and 3.12, the hash function for a PyCodeObject (code_hash()) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode however it likes, which means the bytecode is no longer immutable, which means you can't rely on it not changing, which means you can't use it as part of calculating the hash. Fair enough.

But this means that the hash of a code object no longer factors in the most unique part of a code object. Currently in 3.11 and 3.12 the hash of a code object is calculated using:

the unqualified name of the callable
the code object's const table (a tuple of immutable objects)
the code object's tuple of externally-referenced names (globals, nonlocals)
the code object's tuple of locally-referenced names (parameters, local variables, and closures)
the total number of arguments
the count of positional-only arguments
the count of keyword-only arguments
the "flags" for this code object

Which means it's not hard to construct code objects with identical hashes but different bytecode. For example:

class A:
    def method(self, a, b):
        return a + b

class B:
    def method(self, a, b):
        return a * b

class C:
    def method(self, a, b):
        return a / b

for cls in (A, B, C):
    o = cls()
    print(o.method(3, 5))
    print(hex(hash(cls.method.__code__)))

The hashes for for A.method.__code__, B.method.__code__, and C.method.__code__ are different in Python 3.10, but identical in Python 3.11b3, and presumably in trunk as well.

Is this scenario realistic? I don't know. Certainly I've never seen it. But it's at least plausible; a base class could have a tiny function that only does a little work, and a subclass could override that function and tweak the work done slightly, without relying on additional external names / closures or employing a different list of locals. It's certainly not impossible.

Obviously this is low priority--there's very little code that hashes code objects. It might cause some collisions and probing when marshal dumps a module exhibiting this behavior, because marshal maintains a hash table of the objects it writes. That's the worst side-effect I can think up.

We could mitigate this slightly by factoring in more values into the code object's hash. Three come immediately to mind:

the filename (co_filename)
the first line number (co_firstlineno)
the first column number (which I'm guessing is encoded in co_positions somehow)

With only the first two, you'd still have hash collisions from code objects defined using lambdas all on the same line:


a, b, c = (lambda a, b: a + b), (lambda a, b: a * b), (lambda a, b: a / b)

for l in (a, b, c):
    print(l(3, 5))
    print(hex(hash(l.__code__)))

As unlikely as it is that someone would stumble over this scenario, it's even less likely that it would cause a problem. But decreasing the likelihood of hash value collisions seems so wholesome and good, I think it's worth pursuing.

Linked PRs

gh-100183

sweeneyde commented 2 years ago

+1. Changing to a slightly better algorithm than "xor everything together" may reduce some collisions as well, especially since some of the things code objects deal with (counts of different argument types) are small integers. For instance:

>>> f, g, h = (lambda x: None), (lambda x, /: None), (lambda *, x: None)
>>> hash(f.__code__), hash(g.__code__), hash(h.__code__) # not all the same, but pretty close.
(4013288845865138821, 4013288845865138820, 4013288845865138821)

markshannon commented 2 years ago

We dropped co_code from the hash computation in https://github.com/python/cpython/pull/31888. It should easy enough to restore it.

     hash ^= PyObject_Hash(PyCode_GetCode(co))); /* Pretend there is error checking here */

sweeneyde commented 2 years ago

~~Do we care about code mutation from 'add_load_fast_null_checks'?~~ (no longer exists)

sweeneyde commented 1 year ago

It looks like https://github.com/python/cpython/commit/c7e5bbaee88a71dc6e633e3cd451ed1798436382 helped things, but there is still a decent probability for collisions.

The following code runs into quadratic compile times from hash collisions:

# Inspired by https://github.com/sympy/sympy/blob/9ddfb0600679c2f5ccfb809b6827bf35eeea2c7c/sympy/printing/mathematica.py#L16-L23

funcs = {
    # Everything is the same but the co_firstlineno
    'func0001': lambda: 1,
    'func0002': lambda: 1,
    'func0003': lambda: 1,
    'func0004': lambda: 1,
    ...
}

I opened https://github.com/python/cpython/pull/100183

sweeneyde commented 1 year ago

the filename (co_filename)

If we included the file name in code_hash then we'd also need to include it in code_richcompare.

a98d9ea56e7b473af54438ecc487a6bf1b4d6530 included co_firstlineno and the bytecode. Is it worth backporting to 3.11?

brandtbucher commented 1 year ago

Thanks for picking up this project @sweeneyde!

+1 for including co_filename in hashing and equality. This has actually bitten me in one of my projects (specialist), where code objects seemed to be disappearing when thrown into a set. Sort of an esoteric use case, but a real one nonetheless (I have a post-it note on my desk to open an issue for this, but never got around to it).

-1 for backporting anything that changes the rules of code object equality.

python / cpython

Specializing adaptive interpreter code object hashes are less unique #94155

Linked PRs