omry / omegaconf

Flexible Python configuration system. The last one you will ever need.
BSD 3-Clause "New" or "Revised" License
1.94k stars 105 forks source link

Unstable hash for different data input order #1049

Closed Pedrexus closed 1 year ago

Pedrexus commented 1 year ago

Describe the bug The hash of a DictConfig is not stable. hash(cfg) changes although python dict comparison returns true.

The problem for me is that when reading from a yaml file, different runs of the same script produce different hashes.

To Reproduce

from omegaconf import DictConfig

d1 = DictConfig({"a": 1, "b": 2})
d2 = DictConfig({"b": 2, "a": 1})

d1 == d2, hash(d1) == hash(d2)
# (True, False)

It seems, the hash magic method is defined in a way that is not stable for different dictionary input orders:

def __hash__(self) -> int:
    return hash(str(self))

Expected behavior I think hash should be the same if dict data is the same, regardless of input order.

Not sure actually if this behavior is expected by the project authors, but it was unexpected to me.

Additional context The solution I'm using now is:

from omegaconf import OmegaConf
import hashlib
import json

d1 = DictConfig({"a": 1, "b": 2})
d1_hash = hashlib.sha1(json.dumps(OmegaConf.to_container(d1), sort_keys=True).encode()).hexdigest()
Jasha10 commented 1 year ago

I think hash should be the same if dict data is the same, regardless of input order.

To push back on this, The order of keys in a python dict is part of the dict's api. For example, calling list({"a": 1, "b": 2}.items()) gives a different result from calling list({"b": 2, "a": 1}.items()). For this reason, I think it makes sense for the hash of objects to be different if the order of keys is different.

Pedrexus commented 1 year ago

I understand. Thanks for the reply. It's fine to leave it as a is, but I'd just point out that this only seems to be stable for python>=3.7, so in older python versions you could actually see the same object have different hashes upon a read event or such.

Jasha10 commented 1 year ago

OmegaConf is soon dropping support for python3.6 anyway. (https://github.com/omry/omegaconf/issues/791).

omry commented 1 year ago

It seems correct that hash of equal DictConfigs is the same if the objects are considered equal by ==. This is very low pri because honestly, I am not sure what reason you have to hash those in the first place :). I would not object to a pull request fixing it (maybe by using a variant of str that is sorting the keys from hash, or alternatively just make a proper faster implementation of hash).