Open matthewvanhoutte opened 2 years ago
@matthewvanhoutte This must be a bug. Also DeepHash doesn't use __hash__
. I wonder if I should change it to use __hash__
when present.
@seperman I am always skeptical of raising an issue - usually I have made some incorrect assumption or logical error. Adding __hash__
could be an interesting option. Would this affect the condition of deterministic hashing should the __hash__
not be deterministic in nature? Not sure if I'll be able to help but do let me know if I can.
Also for reference when raising issue I was running : Python 3.6.5 deepdiff 5.7.0
I check against v5.8.0 and seem to have the same results.
Yes if the hash is not deterministic, then the final hash won’t be deterministic either.
Sep Dehpour
On Apr 22, 2022, at 1:55 AM, matthewvanhoutte @.***> wrote:
@seperman I am always skeptical of raising an issue - usually I have made some incorrect assumption or logical error. Adding hash could be an interesting option. Would this affect the condition of deterministic hashing should the hash not be deterministic in nature? Not sure if I'll be able to help but do let me know if I can.
Also for reference when raising issue I was running : Python 3.6.5 deepdiff 5.7.0
I check against v5.8.0 and seem to have the same results.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
@seperman Think I found the issue! Does not look like a bug but the workings of python. Only took an 8 hour flight to figure out that __hash__
is being called without our knowledge by deepdiff when storing the results in the hashes dictionary (a theory - happy to be disproven). Some further explanation into what I have found.
TL;DR
When calling DeepHash._hash
, depending on the ordering of events, the store of previously hashed items DeepHash.hashes
may be altered through python calling an objects __hash__
when writing to the dictionary of hashes. Python is happy with storing objects without __hash__
method but is also happy with calling __hash__
when present to place the object in the dictionary. DeepHash._hash
ends up replacing existing values if they are nested. This happens when the program navigates to base case hash values are added along the way to DeepHash.hashes
but as the program navigates back up the call stack some values can be overwritten due to the behaviour of writing to dictionaries in python. All things considered I have learned something new and feel that I should fly more often! Maybe there is room for some feature development to accomodate classes that have defined a custom __hash__
as you said.
Further Explanation
Using the example provided and using 'obj1' to explain. As the hash values and counts are added to DeepHash.hashes
python calls on the __hash__
method to add the values. Here is the ordering of events as I followed through the many breakpoints!
Break down of object into dictionary and hashing of items at this point we are a few levels down in the stack. When hashing both Name.sn
and Name.ssn
with a value of Tony
as str:Tony
we get 3dcdf73f17a49a2fd18ed426a64354d5ed317715a2443fb5dc4adae49c128d16
. All makes sense to me at this point.
These values are then added each time DeepHash._hash
is called to the DeepHash.hashes
provided the value does not exist already. Cool, so I then found myself crawling forward until all attributes had been hashed and added.
Back we pop up the stack to deal with the entire object hash value. At this point we pop out off line 488 result, counts = self._prep_obj(obj=obj, parent=parent, parents_ids=parents_ids)
. Cool then prepare_string_for_hashing
is called and I ended up with the following value str:objName:{6956bf805412542bac5bdc388b428460a908af3b77b652c78a38774c58e64da7:bbd393a60007e5f9621b8fde442dbcf493227ef7ced9708aa743b46a88e1b49e;9a3715cfe305529c2d0c7f62049f7f3546470ee6ba30591910a0abbe596895b1:3dcdf73f17a49a2fd18ed426a64354d5ed317715a2443fb5dc4adae49c128d16;fdad7d5e2c33a1d91c2bd0d98802a39be8308c67421c7bceee7918dc5f6b8b33:3dcdf73f17a49a2fd18ed426a64354d5ed317715a2443fb5dc4adae49c128d16}
which equates to the following str:objName:{'str:fn':'str:NONE';'str:ssn':'str:Tony';'sn':'Tony'}
. Makes sense. Now the moment of thinking wow, I really need to be more careful in life!
When hashing the prepped string I get 56748d6aff5a78c80ce9a65d5d1e6919755e27ff0e30d55d2b1bf2b15ed60353
. All good, and then we add this value to DeepHash.hashes
, and suddenly the value of DeepHash.hashes['Tony']
changes to this value. At first confusion but then sudden realisation that python is calling __hash__
at this point and overwriting the value.
So all in all - if a class has defined __hash__
and it is not unique then the value can be overwritten by the object hash as programme navigates up the call stack. Now why does this flipflop between the two values I presume has to do with ordering of the values in the iterable. If obj1
occurs before obj2
then we overwrite the hashes['Tony']
with the obj1
hash and use it subsequently when hashing obj2
. If obj2
occurs before obj1
then we end up storing the corret hash for str:Tony
and it does not get overwriten as obj2.__hash__
is unique in our example.
Example of Python behaviour
class WithHash:
def __init__(self,
fn: str = None,
sn: str = None,
ssn: str = None):
self.fn = fn
self.sn = sn
self.ssn = ssn
def __hash__(self):
"""
Enable hashing of class for set usage.
Returns:
hash (str): The hash value of the object.
"""
return hash(self.ssn)
def __eq__(self, other):
"""
Determine equivalence of objects.
Args:
other (Alias): The other object to compare - should compare
against Alias type object
Returns:
equivalence (bool): True if object match based on tuple values of
attributes, otherwise False.
"""
return hash(self) == hash(other)
# Examples with __hash__ defined.
print('Examples using a class which has __hash__ defined:')
hash1 = dict()
hash2 = dict()
obj1 = WithHash(sn='Tony', ssn='Tony')
obj2 = 'Tony'
hash1[obj1] = 'Object'
print('hash1', hash1)
hash1[obj2] = 'String'
print('hash1', hash1)
hash2[obj2] = 'String'
print('hash2', hash2)
hash2[obj1] = 'Object'
print('hash2', hash2)
class NoHash:
def __init__(self,
fn: str = None,
sn: str = None,
ssn: str = None):
self.fn = fn
self.sn = sn
self.ssn = ssn
# Examples without __hash__ defined.
print('', 'Examples using a class without __hash__ defined:', sep='\n')
nohash1 = dict()
nohash2 = dict()
obj1 = NoHash(sn='Tony', ssn='Tony')
obj2 = 'Tony'
nohash1[obj1] = 'Object'
print('nohash1', nohash1)
nohash1[obj2] = 'String'
print('nohash1', nohash1)
nohash2[obj2] = 'String'
print('nohash2', nohash2)
nohash2[obj1] = 'Object'
print('nohash2', nohash2)
Output from example
Examples using a class which has __hash__ defined:
hash1 {<__main__.WithHash object at 0x7f139e5cd8b0>: 'Object'}
hash1 {<__main__.WithHash object at 0x7f139e5cd8b0>: 'String'}
hash2 {'Tony': 'String'}
hash2 {'Tony': 'Object'}
Examples using a class without __hash__ defined:
nohash1 {<__main__.NoHash object at 0x7f139e4b57f0>: 'Object'}
nohash1 {<__main__.NoHash object at 0x7f139e4b57f0>: 'Object', 'Tony': 'String'}
nohash2 {'Tony': 'String'}
nohash2 {'Tony': 'String', <__main__.NoHash object at 0x7f139e4b57f0>: 'Object'}
I hope that some of the rambels made sense, let me know what you think. Also thank you for open source package!
@matthewvanhoutte This is brilliant! Thanks for diving into this. Makes me wonder how the hashes
dictionary in DeepHash should handle hash collisions.
One way we could handle the hash collisions is to store the objects by their id in the hashes
dictionary. In fact we do that for objects that are not hashable but we could extend that to all objects.
Despite the goal of DeepHash to "hash objects based on their contents", we may just need to run the __hash__
function when it exists.
What do you think?
@seperman I think this both are great. Would we need to consider storage for point 1 as the same content could be stored multiple times? To be honest I think it would be negligible anyways. It seems like the safest way to avoid hash collisions.
With regards to point 2 an extra parameter could deal with the need to run __hash__
if present and if the user does not want to use __hash__
then the DeepDiff object hash method is called to generate the hash value. This used in conjunction with point 1 should cover the hash collision issue, right?
@matthewvanhoutte Custom objects come with the __hash__
function but that doesn't mean the hash here provides any value for DeepHash:
>>> class A:
... pass
...
>>> a=A()
>>> hash(a)
8774872643467
So I think we may have to make the user pass an explicit list of objects that their __hash__
needs to be called because of some custom implementation of the __hash__
function. Is that really useful? I have difficulty imagining people passing a list of their custom objects that need to be considered for this purpose.
@seperman Totally forgot that by default custom objects get __hash__
methods!! If I recall correctly, they do no return the same hash value when instantiated and like you say are not particularly useful DeepHash.
>>> class A:
... pass
...
>>> a = A()
>>> hash(a)
8778368913086
>>> b = A()
>>> hash(b)
8778368879496
I also have difficultly in imagining people passing a list of their custom objects to call or skip their custom hash method when hashing using DeepHash. It would make more sense for them to use their own method seeing as they have defined it. Should one be using DeepHash it would make more sense to me to ignore all __hash__
methods as is done already and potentially for those store based on ID like you said before to bypass the hash collision. I looked into this and found there would be some changes to the tests. I think it could cause compatibility issues for those that may retrieving content from the DeepHash result.
I do wonder if there are other cases that may result in hash collisions which we are unaware of. I suppose as an alternative we could add a warning / exception to catch the hash collision before it is overwritten. This way the user is highlighted to the fact that there is a hash collision and can deal with it in a manner that best suites them?
I also found my self in recursion depth heaven when thinking let me be smart and use DeepHash as my __hash__
method clearly forgetting that DeepHash calls __hash__
of the object when storing it in the hashes dict and I also defined my equality operator using the hash method which then resulted in further recursion. Happy to close this issue if you think there is no real advantages.
Is your feature request related to a problem? Please describe. I am using DeepHash to hash an object with multiple nested objects which I can then compare against at a later date. The nested objects are part of sets - the objects have their own hashing function that is used by python to ensure no duplicate items are added. It seems that when using DeepHash on the same objects in a set I get two different values and seem to flipflop between the two. Any ideas spring to mind? I know that python's internal hashing is not stable and they recommend using
hashlib
but I have seen thatdeepdiff
useshashlib
so a bit at a loss. The main requirement is deterministic hashing as these items serve as lookup references at a later date. An example of something I am trying to do without nesting shown below though the example reproduces the same behaviour.