Open larryhastings opened 2 years ago
+1. Changing to a slightly better algorithm than "xor everything together" may reduce some collisions as well, especially since some of the things code objects deal with (counts of different argument types) are small integers. For instance:
>>> f, g, h = (lambda x: None), (lambda x, /: None), (lambda *, x: None)
>>> hash(f.__code__), hash(g.__code__), hash(h.__code__) # not all the same, but pretty close.
(4013288845865138821, 4013288845865138820, 4013288845865138821)
We dropped co_code
from the hash computation in https://github.com/python/cpython/pull/31888.
It should easy enough to restore it.
hash ^= PyObject_Hash(PyCode_GetCode(co))); /* Pretend there is error checking here */
Do we care about code mutation from 'add_load_fast_null_checks'? (no longer exists)
It looks like https://github.com/python/cpython/commit/c7e5bbaee88a71dc6e633e3cd451ed1798436382 helped things, but there is still a decent probability for collisions.
The following code runs into quadratic compile times from hash collisions:
# Inspired by https://github.com/sympy/sympy/blob/9ddfb0600679c2f5ccfb809b6827bf35eeea2c7c/sympy/printing/mathematica.py#L16-L23
funcs = {
# Everything is the same but the co_firstlineno
'func0001': lambda: 1,
'func0002': lambda: 1,
'func0003': lambda: 1,
'func0004': lambda: 1,
...
}
the filename (co_filename)
If we included the file name in code_hash
then we'd also need to include it in code_richcompare
.
a98d9ea56e7b473af54438ecc487a6bf1b4d6530 included co_firstlineno
and the bytecode. Is it worth backporting to 3.11?
Thanks for picking up this project @sweeneyde!
+1 for including co_filename
in hashing and equality. This has actually bitten me in one of my projects (specialist
), where code objects seemed to be disappearing when thrown into a set. Sort of an esoteric use case, but a real one nonetheless (I have a post-it note on my desk to open an issue for this, but never got around to it).
-1 for backporting anything that changes the rules of code object equality.
In Python 3.11 and 3.12, the hash function for a
PyCodeObject
(code_hash()
) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode however it likes, which means the bytecode is no longer immutable, which means you can't rely on it not changing, which means you can't use it as part of calculating the hash. Fair enough.But this means that the hash of a code object no longer factors in the most unique part of a code object. Currently in 3.11 and 3.12 the hash of a code object is calculated using:
Which means it's not hard to construct code objects with identical hashes but different bytecode. For example:
The hashes for for
A.method.__code__
,B.method.__code__
, andC.method.__code__
are different in Python 3.10, but identical in Python 3.11b3, and presumably in trunk as well.Is this scenario realistic? I don't know. Certainly I've never seen it. But it's at least plausible; a base class could have a tiny function that only does a little work, and a subclass could override that function and tweak the work done slightly, without relying on additional external names / closures or employing a different list of locals. It's certainly not impossible.
Obviously this is low priority--there's very little code that hashes code objects. It might cause some collisions and probing when marshal dumps a module exhibiting this behavior, because marshal maintains a hash table of the objects it writes. That's the worst side-effect I can think up.
We could mitigate this slightly by factoring in more values into the code object's hash. Three come immediately to mind:
co_filename
)co_firstlineno
)co_positions
somehow)With only the first two, you'd still have hash collisions from code objects defined using lambdas all on the same line:
As unlikely as it is that someone would stumble over this scenario, it's even less likely that it would cause a problem. But decreasing the likelihood of hash value collisions seems so wholesome and good, I think it's worth pursuing.
Linked PRs