nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
119 stars 57 forks source link

Centralize object hashing and provide a mechanism for types to register a hash #626

Open effigies opened 1 year ago

effigies commented 1 year ago

Right now we have hashing split up in a few places:

https://github.com/nipype/pydra/blob/b5fe4c0eb7f937e70db15bc087d86fe90f401ff3/pydra/engine/helpers.py#L677-L708

https://github.com/nipype/pydra/blob/b5fe4c0eb7f937e70db15bc087d86fe90f401ff3/pydra/engine/helpers.py#L672-L674

https://github.com/nipype/pydra/blob/b5fe4c0eb7f937e70db15bc087d86fe90f401ff3/pydra/engine/helpers_file.py#L70-L168

An alternative approach could be to use functools.singledispatch:

@functools.singledispatch
def hash_obj(obj: object) -> bytes:
    # Works for generic objects with __dict__
    dict_rep = ":".join(":".join(key, hash_obj(val)) for key, val in obj.__dict__.items())
    return sha256(f"{obj.__class__}:{dict_rep}".encode()).hexdigest() 

This defines a cryptographic hash for a generic object that applies recursively. We would need some bottom types that don't have __dict__:

@hash_obj.register
def _(obj: int) -> bytes: ...

@hash_obj.register
def _(obj: str) -> bytes: ...

@hash_obj.register
def _(obj: dict) -> bytes: ...

And each type would be able to declare how much is needed to uniquely identify it across instances. We could add set() and frozenset() to ensure that these known-problematic builtin types are consistent. And then provide a means for a downstream tool to register a type with our hasher, such as:

@pydra.util.hash_obj.register
def _(obj: MyType) -> bytes:
    ...

or

pydra.utils.register_hash(MyType, myhashfun)
effigies commented 1 year ago

Btw I found a stackoverflow for this issue exactly: https://softwareengineering.stackexchange.com/questions/405243/how-to-perform-consistent-hashing-on-any-python-object-that-works-with-hash

tclose commented 1 year ago

This looks good to me