seperman / deepdiff

DeepDiff: Deep Difference and search of any Python object/data. DeepHash: Hash of any object based on its contents. Delta: Use deltas to reconstruct objects by adding deltas together.
http://zepworks.com
Other
2k stars 219 forks source link

Dealing with objects in sets. #305

Open matthewvanhoutte opened 2 years ago

matthewvanhoutte commented 2 years ago

Is your feature request related to a problem? Please describe. I am using DeepHash to hash an object with multiple nested objects which I can then compare against at a later date. The nested objects are part of sets - the objects have their own hashing function that is used by python to ensure no duplicate items are added. It seems that when using DeepHash on the same objects in a set I get two different values and seem to flipflop between the two. Any ideas spring to mind? I know that python's internal hashing is not stable and they recommend using hashlib but I have seen that deepdiff uses hashlib so a bit at a loss. The main requirement is deterministic hashing as these items serve as lookup references at a later date. An example of something I am trying to do without nesting shown below though the example reproduces the same behaviour.

class Name:

    def __init__(self,
                 fn: str = None,
                 sn: str = None,
                 ssn: str = None):

        self.fn = fn
        self.sn = sn
        self.ssn = ssn

    def __hash__(self):
        """
        Enable hashing of class for set usage.

        Returns:
            hash (str): The hash value of the object.
        """
        return hash(self.ssn)

    def __eq__(self, other):
        """
        Determine equivalence of objects.

        Args:
            other (Alias): The other object to compare - should compare
                against Alias type object

        Returns:
            equivalence (bool): True if object match based on tuple values of
                attributes, otherwise False.
        """
        return hash(self) == hash(other)

from deepdiff.deephash import DeepHash

obj1 = Name(sn='Tony',
            ssn='Tony')
obj2 = Name(fn='Tiny',
            sn='Tony',
            ssn='Tiny Tony')
obj3 = Name(sn='Tony',
            ssn='Tony')

# I use sets to drop unwanted duplicated using a hash method from class.
# This was prior to knowledge of DeepHash. Result of this should be a set with
# obj1 and obj2 of which the ordering may differ depending on hash generated.
# As we know python hash differs from process to process but remains consistent
# within the same process. I need this to remain consistent across processes.

objects = {obj1, obj2, obj3}

# When running DeepHash on 'objects' it differs from process to process.
# I get the following two hashes:
# f261026e3e51aac71fb74323b324b0313d19246031510ae8d79749bf87247050
# 93a6ceff87e2b04675ec657a78e65843d1a9477bdd37128fe83f1b4b4fbcab26
print(DeepHash(objects)[objects])

# From what I gathered in some cases the hash calculation is either done on the
# whole object plus the string i.e str:obj:xxxxxxxxx:yyyyyyyyy:zzzzzzzzz... and
# in the other case simply what I would expect str:Tony. Not sure how it all
# works with how DeepHash and if this is desired behaviour.
seperman commented 2 years ago

@matthewvanhoutte This must be a bug. Also DeepHash doesn't use __hash__. I wonder if I should change it to use __hash__ when present.

matthewvanhoutte commented 2 years ago

@seperman I am always skeptical of raising an issue - usually I have made some incorrect assumption or logical error. Adding __hash__ could be an interesting option. Would this affect the condition of deterministic hashing should the __hash__ not be deterministic in nature? Not sure if I'll be able to help but do let me know if I can.

Also for reference when raising issue I was running : Python 3.6.5 deepdiff 5.7.0

I check against v5.8.0 and seem to have the same results.

seperman commented 2 years ago

Yes if the hash is not deterministic, then the final hash won’t be deterministic either.

Sep Dehpour

On Apr 22, 2022, at 1:55 AM, matthewvanhoutte @.***> wrote:

 @seperman I am always skeptical of raising an issue - usually I have made some incorrect assumption or logical error. Adding hash could be an interesting option. Would this affect the condition of deterministic hashing should the hash not be deterministic in nature? Not sure if I'll be able to help but do let me know if I can.

Also for reference when raising issue I was running : Python 3.6.5 deepdiff 5.7.0

I check against v5.8.0 and seem to have the same results.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

matthewvanhoutte commented 2 years ago

@seperman Think I found the issue! Does not look like a bug but the workings of python. Only took an 8 hour flight to figure out that __hash__ is being called without our knowledge by deepdiff when storing the results in the hashes dictionary (a theory - happy to be disproven). Some further explanation into what I have found.

TL;DR When calling DeepHash._hash, depending on the ordering of events, the store of previously hashed items DeepHash.hashes may be altered through python calling an objects __hash__ when writing to the dictionary of hashes. Python is happy with storing objects without __hash__ method but is also happy with calling __hash__ when present to place the object in the dictionary. DeepHash._hash ends up replacing existing values if they are nested. This happens when the program navigates to base case hash values are added along the way to DeepHash.hashes but as the program navigates back up the call stack some values can be overwritten due to the behaviour of writing to dictionaries in python. All things considered I have learned something new and feel that I should fly more often! Maybe there is room for some feature development to accomodate classes that have defined a custom __hash__ as you said.

Further Explanation

Using the example provided and using 'obj1' to explain. As the hash values and counts are added to DeepHash.hashes python calls on the __hash__ method to add the values. Here is the ordering of events as I followed through the many breakpoints!

  1. Break down of object into dictionary and hashing of items at this point we are a few levels down in the stack. When hashing both Name.sn and Name.ssn with a value of Tony as str:Tony we get 3dcdf73f17a49a2fd18ed426a64354d5ed317715a2443fb5dc4adae49c128d16. All makes sense to me at this point.

  2. These values are then added each time DeepHash._hash is called to the DeepHash.hashes provided the value does not exist already. Cool, so I then found myself crawling forward until all attributes had been hashed and added.

  3. Back we pop up the stack to deal with the entire object hash value. At this point we pop out off line 488 result, counts = self._prep_obj(obj=obj, parent=parent, parents_ids=parents_ids). Cool then prepare_string_for_hashing is called and I ended up with the following value str:objName:{6956bf805412542bac5bdc388b428460a908af3b77b652c78a38774c58e64da7:bbd393a60007e5f9621b8fde442dbcf493227ef7ced9708aa743b46a88e1b49e;9a3715cfe305529c2d0c7f62049f7f3546470ee6ba30591910a0abbe596895b1:3dcdf73f17a49a2fd18ed426a64354d5ed317715a2443fb5dc4adae49c128d16;fdad7d5e2c33a1d91c2bd0d98802a39be8308c67421c7bceee7918dc5f6b8b33:3dcdf73f17a49a2fd18ed426a64354d5ed317715a2443fb5dc4adae49c128d16} which equates to the following str:objName:{'str:fn':'str:NONE';'str:ssn':'str:Tony';'sn':'Tony'}. Makes sense. Now the moment of thinking wow, I really need to be more careful in life!

  4. When hashing the prepped string I get 56748d6aff5a78c80ce9a65d5d1e6919755e27ff0e30d55d2b1bf2b15ed60353. All good, and then we add this value to DeepHash.hashes, and suddenly the value of DeepHash.hashes['Tony'] changes to this value. At first confusion but then sudden realisation that python is calling __hash__ at this point and overwriting the value.

So all in all - if a class has defined __hash__ and it is not unique then the value can be overwritten by the object hash as programme navigates up the call stack. Now why does this flipflop between the two values I presume has to do with ordering of the values in the iterable. If obj1 occurs before obj2 then we overwrite the hashes['Tony'] with the obj1 hash and use it subsequently when hashing obj2. If obj2 occurs before obj1 then we end up storing the corret hash for str:Tony and it does not get overwriten as obj2.__hash__ is unique in our example.

Example of Python behaviour

class WithHash:

    def __init__(self,
                 fn: str = None,
                 sn: str = None,
                 ssn: str = None):

        self.fn = fn
        self.sn = sn
        self.ssn = ssn

    def __hash__(self):
        """
        Enable hashing of class for set usage.

        Returns:
            hash (str): The hash value of the object.
        """
        return hash(self.ssn)

    def __eq__(self, other):
        """
        Determine equivalence of objects.

        Args:
            other (Alias): The other object to compare - should compare
                against Alias type object

        Returns:
            equivalence (bool): True if object match based on tuple values of
                attributes, otherwise False.
        """
        return hash(self) == hash(other)

# Examples with __hash__ defined.
print('Examples using a class which has __hash__ defined:')
hash1 = dict()
hash2 = dict()
obj1 = WithHash(sn='Tony', ssn='Tony')
obj2 = 'Tony'

hash1[obj1] = 'Object'
print('hash1', hash1)
hash1[obj2] = 'String'
print('hash1', hash1)

hash2[obj2] = 'String'
print('hash2', hash2)
hash2[obj1] = 'Object'
print('hash2', hash2)

class NoHash:

    def __init__(self,
                 fn: str = None,
                 sn: str = None,
                 ssn: str = None):

        self.fn = fn
        self.sn = sn
        self.ssn = ssn

# Examples without __hash__ defined.
print('', 'Examples using a class without __hash__ defined:', sep='\n')
nohash1 = dict()
nohash2 = dict()
obj1 = NoHash(sn='Tony', ssn='Tony')
obj2 = 'Tony'

nohash1[obj1] = 'Object'
print('nohash1', nohash1)
nohash1[obj2] = 'String'
print('nohash1', nohash1)

nohash2[obj2] = 'String'
print('nohash2', nohash2)
nohash2[obj1] = 'Object'
print('nohash2', nohash2)

Output from example

Examples using a class which has __hash__ defined:
hash1 {<__main__.WithHash object at 0x7f139e5cd8b0>: 'Object'}
hash1 {<__main__.WithHash object at 0x7f139e5cd8b0>: 'String'}
hash2 {'Tony': 'String'}
hash2 {'Tony': 'Object'}

Examples using a class without __hash__ defined:
nohash1 {<__main__.NoHash object at 0x7f139e4b57f0>: 'Object'}
nohash1 {<__main__.NoHash object at 0x7f139e4b57f0>: 'Object', 'Tony': 'String'}
nohash2 {'Tony': 'String'}
nohash2 {'Tony': 'String', <__main__.NoHash object at 0x7f139e4b57f0>: 'Object'}

I hope that some of the rambels made sense, let me know what you think. Also thank you for open source package!

seperman commented 2 years ago

@matthewvanhoutte This is brilliant! Thanks for diving into this. Makes me wonder how the hashes dictionary in DeepHash should handle hash collisions.

  1. One way we could handle the hash collisions is to store the objects by their id in the hashes dictionary. In fact we do that for objects that are not hashable but we could extend that to all objects.

  2. Despite the goal of DeepHash to "hash objects based on their contents", we may just need to run the __hash__ function when it exists.

What do you think?

matthewvanhoutte commented 2 years ago

@seperman I think this both are great. Would we need to consider storage for point 1 as the same content could be stored multiple times? To be honest I think it would be negligible anyways. It seems like the safest way to avoid hash collisions.

With regards to point 2 an extra parameter could deal with the need to run __hash__ if present and if the user does not want to use __hash__ then the DeepDiff object hash method is called to generate the hash value. This used in conjunction with point 1 should cover the hash collision issue, right?

seperman commented 2 years ago

@matthewvanhoutte Custom objects come with the __hash__ function but that doesn't mean the hash here provides any value for DeepHash:

>>> class A:
...     pass
...
>>> a=A()
>>> hash(a)
8774872643467

So I think we may have to make the user pass an explicit list of objects that their __hash__ needs to be called because of some custom implementation of the __hash__ function. Is that really useful? I have difficulty imagining people passing a list of their custom objects that need to be considered for this purpose.

matthewvanhoutte commented 2 years ago

@seperman Totally forgot that by default custom objects get __hash__ methods!! If I recall correctly, they do no return the same hash value when instantiated and like you say are not particularly useful DeepHash.

>>> class A:
...     pass
... 
>>> a = A()
>>> hash(a)
8778368913086
>>> b = A()
>>> hash(b)
8778368879496

I also have difficultly in imagining people passing a list of their custom objects to call or skip their custom hash method when hashing using DeepHash. It would make more sense for them to use their own method seeing as they have defined it. Should one be using DeepHash it would make more sense to me to ignore all __hash__ methods as is done already and potentially for those store based on ID like you said before to bypass the hash collision. I looked into this and found there would be some changes to the tests. I think it could cause compatibility issues for those that may retrieving content from the DeepHash result.

I do wonder if there are other cases that may result in hash collisions which we are unaware of. I suppose as an alternative we could add a warning / exception to catch the hash collision before it is overwritten. This way the user is highlighted to the fact that there is a hash collision and can deal with it in a manner that best suites them?

I also found my self in recursion depth heaven when thinking let me be smart and use DeepHash as my __hash__ method clearly forgetting that DeepHash calls __hash__ of the object when storing it in the hashes dict and I also defined my equality operator using the hash method which then resulted in further recursion. Happy to close this issue if you think there is no real advantages.