uniphil / SuperHash

Hash Anything
5 stars 2 forks source link

Object Hashes are Always Unique #2

Open uniphil opened 11 years ago

uniphil commented 11 years ago

Python can, by default, hash objects subclassing object. Every instance is guaranteed a unique hash for the session, which conflicts with the behavior of superhash as prescribed by its current unit tests.

This is a design problem as much as anything. Should two instances of an object who look completely identical hash the same way?

Should NOT (current behavior, why unit tests are failing): Identical hashes imply a single identity. So for example any hashable object can be used as a dictionary key, and if two different instances shared a hash then either could access the dictionary value. ... they're identical, so maybe this has a use, but it does not seem like an expected behavior. Python's default hashing of objects guarantees each object its own unique hash, regardless of (and indeed not changing in response to changes of) its properties.

SHOULD: I wrote this library initially because I wanted a simple way to implement file-based caching. I could save some expensive computations for a set of input parameters in a file named after the hash of a [object/dict/list/whatever] of the input parameters, and then for every new set of parameters I could first check to see if an identical set had already been computed or not by looking at the hash.

Should a hash represent be consistent with instance identity, or data content?

Most likely I'll implement both modes and make the SHOULD NOT version default to be consistent with python's built-in behavior.

jonathaneunice commented 11 years ago

I have been struggling with this as well. superhash() does not obey x == y => hash(x) == hash(y), which is the putative goal in Python. The notion of equality is slippery, however, and never perfectly implementable. Like hash(), superhash() comes close--but obeying slightly different rules for equality than the pure ==. For my application, "same contents as" is the right metric--and happily I only need to know about instance equality, rather than tackling the harder, more metaphysical questions such as class equality.

So I vote: CONTENT, not identity.

uniphil commented 11 years ago

Right. Yeah now I'm leaning toward content... seems like the most likely case where superhash() is useful, and writing that down gives a general design rule to follow which is nice. The current implementation really focuses on content as well.