python-cachier / cachier

Persistent, stale-free, local and cross-machine caching for Python functions.
MIT License
542 stars 60 forks source link

update default args+kwargs hashing #183

Closed Borda closed 8 months ago

Borda commented 8 months ago

resolves #136

codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (7039391) 97.85% compared to head (5cf4c12) 97.84%.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/python-cachier/cachier/pull/183/graphs/tree.svg?width=650&height=150&src=pr&token=fhsTDs7HL9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier)](https://app.codecov.io/gh/python-cachier/cachier/pull/183?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier) ```diff @@ Coverage Diff @@ ## master #183 +/- ## ========================================== - Coverage 97.85% 97.84% -0.02% ========================================== Files 6 6 Lines 514 511 -3 Branches 96 95 -1 ========================================== - Hits 503 500 -3 Misses 10 10 Partials 1 1 ``` | [Files](https://app.codecov.io/gh/python-cachier/cachier/pull/183?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier) | Coverage Δ | | |---|---|---| | [cachier/core.py](https://app.codecov.io/gh/python-cachier/cachier/pull/183?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier#diff-Y2FjaGllci9jb3JlLnB5) | `100.00% <100.00%> (ø)` | | ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/python-cachier/cachier/pull/183?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/python-cachier/cachier/pull/183?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier). Last update [7039391...5cf4c12](https://app.codecov.io/gh/python-cachier/cachier/pull/183?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-cachier).
Borda commented 8 months ago

I may see what you mean, but not sure, so would it be better to move it to the default function so no effect for users who already have their own hashing? also, the hash may reflect the initial data type; what do you think?

shaypal5 commented 8 months ago

Regarding your first suggestion: No, that's not enough.

Regarding your second on: Ok, that interesting. If you append each tuple-from-list with a custom marker object, it will always have a different hash.

I mean that real tuples get hashed as: (1, 'b').

But the a list of [1, 'b'] should be converted so:

class ListTupleMarker:
    pass

LIST_TUPLE_MARKER = ListTupleMarker()

# internally this is (1, 'b', LIST_TUPLE_MARKER)
list_as_tuple = (*as_list, LIST_TUPLE_MARKER)
Borda commented 8 months ago

The method that uses pickling to serialize the arguments will indeed preserve argument types because pickling captures the data structure along with its type information. When you pickle a Python object, the resulting byte stream fully encodes the object's structure and the types of its contents.

Therefore, when you serialize args and kwargs using pickle.dumps(), each serialized form represents the combination of values and their types. When you subsequently generate a hash from this serialized form, the hash is implicitly sensitive not only to the values but also to their types.

Here's an example demonstrating this:

import pickle
import hashlib

# Integer and string with same "value"
arg_int = 1
arg_str = '1'

# Serialize each argument
serialized_int = pickle.dumps(arg_int)
serialized_str = pickle.dumps(arg_str)

# They produce different byte streams
assert serialized_int != serialized_str

# So their hashes will also be different
hash_int = hashlib.sha256(serialized_int).hexdigest()
hash_str = hashlib.sha256(serialized_str).hexdigest()

assert hash_int != hash_str

print(f'Hash for integer 1: {hash_int}')
print(f'Hash for string "1": {hash_str}')
Borda commented 8 months ago

@shaypal5, mind check it now? :flags: