uqfoundation / dill

serialize all of Python
http://dill.rtfd.io
Other
2.27k stars 181 forks source link

function pickles are only quasi-deterministic, does this matter? #19

Open stepleton-xx opened 10 years ago

stepleton-xx commented 10 years ago

Pickles of functions (and pickles of things that contain functions, like classes) are not quite deterministic---they depend on iteration order of the _reverse_typemap inside dill.py. Depending on the order, either the symbol "LambdaType" or "FunctionType" will be used to represent functions. Either will work as far as unpickling goes, but having different representations of the same value can cause trouble with e.g. caching.

While most invocations of Python 2.x yield the same iteration order for _reverse_typemap, use of the -R flag (recommended for user-facing services; c.f. http://www.ocert.org/advisories/ocert-2011-003.html) randomizes this order.

Note that the functionality of -R is on by default for versions >= 3.3: http://docs.python.org/3/whatsnew/3.3.html

mmckerns commented 10 years ago

@stepleton: On my most recent commit, I did notice that the ordering of the _typemap is different in 3.x [where x = 3, and possibly 2] versus older versions, including the entire 2.x line. However, in some limited testing, I didn't experience any failures to pick up LambdaType versus FunctionType, for example. I didn't however stress test it to make sure it absolutely wasn't a problem. _reverse_typemap as is, should include both LambdaType and FunctionType (and the other three cases where there are two identical types). _typemap has the types themselves as the keys, so there will be 4 less items in _typemap than in _reverse_typemap. What does that mean? Does it cover the issue?

@register can only register the type, so it's equivalent to @register(FunctionType) or @register(LambdaType), I believe. The pickled object shouldn't know what type it is, so as long as either LambdaType or FunctionType (reverse type)map to the same type... the lookup using the object as the key should uncover one of the two (LambdaType or FunctionType), depending on the possibly random order the dict was populated with.

That was my thinking in it being declared "not a problem".

Do you have any test cases for dill actually failing here? Not that the _typemap is deterministic, because, as you say, it is quasi-deterministic in +3.3... but a case were dill fails to correctly dump/load a type that it should know. Something like, save a "lambda" to a (session) file, then try loading it 1000 times -- if it fails once or more, then there's a problem. Is this something like the case you had in mind?

asmeurer commented 10 years ago

You can make the hash randomization deterministic for testing purposes by setting the PYTHONHASHSEED environment variable to a positive integer.

davips commented 2 years ago

Any news regarding providing determinism for dill? I am using it only for hashing, so is there any workaround?

anivegesana commented 2 years ago

Are you using Python 3.7+? Because dictionaries are now insertion order, for any particular version of Python after 3.6, the type dictionary should be deterministic, and this issue no longer applies. Since dill 0.3.6 will no longer support Python 2.7, this issue should no longer be relevant if I am not mistaken.

davips commented 2 years ago

Thanks for the follow-up. It still gives different hashes in python 3.8.10, despite being deterministic if we start a new execution:

In [3]: import dill as d; from hashlib import md5; f = lambda x: x + 1; dum = d.dumps(f); md5(dum).hexdigest()
Out[3]: 'bbe58ab30aec9b1feb80cd3e4b2b7024'

In [4]: import dill as d; from hashlib import md5; f = lambda x: x + 1; dum = d.dumps(f); md5(dum).hexdigest()
Out[4]: '3fc384d6edf117061f781113a60fd053'
mmckerns commented 2 years ago

I'm not seeing the same behavior in 3.8.13.

Python 3.8.13 (default, May 10 2022, 11:26:38) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> dill.settings['recurse'] = True
>>> from hashlib import md5
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'65e4cf11903c322b610b68ddb7c71283'
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'65e4cf11903c322b610b68ddb7c71283'
>>> 
>>> dill.settings['recurse'] = False
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
davips commented 2 years ago

The difference in our outputs is not due to the python version, but due to how different consoles handle the context (it seems). If I use python instead of ipython it works, but if I put everything inside the same file and run as a script, it doesn't work anymore. I've seen somewhere in the past about a global dict that stores the module functions, and that changes the content of what is pickled. AFAICR, the problem is the same if you have a file with other functions defined, not necessarily a repetition. E.g.: File A: def f() def h()

File B: def g() def h()

Python 3.8.10 (default, Mar 15 2022, 12:22:08) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill; from hashlib import md5; dill.settings['recurse'] = True; f = lambda x:x+1; dum = dill.dumps(f); dill.settings['recurse'] = False; f = lambda x:x+1; dum = dill.dumps(f); md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
>>> import dill; from hashlib import md5; dill.settings['recurse'] = True; f = lambda x:x+1; dum = dill.dumps(f); dill.settings['recurse'] = False; f = lambda x:x+1; dum = dill.dumps(f); md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
>>> 
mmckerns commented 2 years ago

I see the behavior in IPython

Python 3.8.13 (default, May 10 2022, 11:26:38) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import dill

In [2]: from hashlib import md5

In [3]: f = lambda x:x+1

In [4]: dum = dill.dumps(f)

In [5]: md5(dum).hexdigest()
Out[5]: '2a7b752183b27084ae5b3806d6b6fae3'

In [6]: f = lambda x:x+1

In [7]: dum = dill.dumps(f)

In [8]: md5(dum).hexdigest()
Out[8]: 'f5b16c000549792da22973e74cf4dd3a'
mmckerns commented 2 years ago

Looks like IPython is doing something to modify the function, or one of it's dependencies (like globals), so I agree, it's probably a contextual issue.

anivegesana commented 2 years ago

A couple of other Python libraries (like huggingface datasets) depend on persistent hashing of dill pickles, so it is something that would effect at least one popular Python library in production.

What huggingface does is sort the names of the globals dictionary and it mostly works for their purposes. I am planning to rewrite globals in #466, so I am trying to make it easy to add custom processing to global dictionaries outside of dill. And it would, at minimum, benefit the guys over at huggingface.

As huggingface noticed, another big thing holding back the determinism of dill pickles are the line numbers, filename, and other environmental information stored in the code objects. We could add a setting to dill to turn these features off and sort the globals by name.

anivegesana commented 2 years ago

@davips For a temporary fix if you are still using dill 0.3.4 (they haven't updated the code for dill 0.3.5 yet), you can just snag the code from huggingface datasets (either import their pickler or just copy the code out of their repo.) I will be creating a more sustainable solution that is built into dill.

https://github.com/huggingface/datasets/blob/95193ae61e92aa537d0c65d37a1fd9d2393aae89/src/datasets/utils/py_utils.py#L416-L702

!pip install datasets
from hashlib import md5
import io

import dill
from datasets.utils.py_utils import Pickler

def dumps(obj, **kwds):
    file = io.BytesIO()
    Pickler(file, **kwds).dump(obj)
    return file.getvalue()
>>> md5(dill.dumps(lambda: 'q')).hexdigest()
d6cd2da806a4384c10bbea6a2c496b3c
>>> md5(dill.dumps(lambda: 'q')).hexdigest()
1d5a198b1ffbe2c3e3118c0ff6a6eb3b
>>> md5(dumps(lambda: 'q')).hexdigest()
211cbff14077367c0c870d82a56f94bb
>>> md5(dumps(lambda: 'q')).hexdigest()
211cbff14077367c0c870d82a56f94bb
davips commented 2 years ago

Thanks very much @anivegesana ! It seems to work fine this way and should enable resuming some interesting projects here... 🥇

anivegesana commented 2 years ago

As far as I am aware, sets (and frozensets) are nondeterministic but are also unsortable in some cases:

>>> sorted({object(), object()})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'object' and 'object'

The best that can be done is to either continue to produce nondeterministic pickles only when the objects in the set are incomparable or to produce nondeterministic pickles for sets no matter what.

davips commented 2 years ago

Maybe it doesn't help, depending on how dill works, but I'll leave it here as a possible future approach for sets: the items of the set will be sortable after being pickled.

mmckerns commented 2 years ago

you can sort anything if you give a sort key that provides a rule for how to sort.

>>> sorted({object(), object()}, key=lambda x:id(x))
[<object object at 0x10ca54560>, <object object at 0x10ca545d0>]
davips commented 2 years ago

Maybe it doesn't help, depending on how dill works, but I'll leave it here as a possible future approach for sets: the items of the set will be sortable after being pickled.

Unfortunately, stability of sets is restricted to a few cases (besides being not official) and that doesn't include bytes. https://stackoverflow.com/a/45589593/9681577 https://stackoverflow.com/a/58330556/9681577

The order will be lost after puting the serialized items inside a set again, so it would need a custom class for sets that implements a __reduce__ using a list.

I can't get into dill internals now, but if it is a recursive function, seems to be easy:

def dump(obj):
    if isinstance(obj, set):
        pickled_items = [sorted(dump(item) for item in obj)]   # list of bytes objects
        return dump(CustomSet(pickled_items))

    ... other types ...
anivegesana commented 2 years ago

I didn't take a deep look into it, but I would be against using this to solve the problem since it would have O(n^2) time and space complexity for nestings that look like this:

class A:
   def __init__(self, a=None):
       self.a=a

def b(o, N):
    for i in range(N):
        o = {A(o)}
    return o

Even worse, this causes infinite recursion:

s = {A(5)}
s[0].a = s

Even for less pathological cases, I would imagine that that solution could be very slow. I didn't have time yet to think about the average case, but I have a hunch that it is just as bad.

davips commented 2 years ago

Do you say O(n^2) because the pickled items will be repickled? I'm not specialist in pickle, but I remember that at least protocol=5 does something clever with bytes objects (or numpy arrays as I recall).

Anyway, there are other solutions. For instance, handling sorting case by case is a solution, but some fallback is needed for the cases not addressed (e.g. user classes?). I also don't like a costful fallback, but if that's the only option at the moment to ensure determinism... the user could choose whether to enforce determinism "at all costs" or just for the easier cases. Something like the modes/flags bellow (just some thoughts here, don't take them literally):

davips commented 2 years ago

About the recursion, is dill.settings['recurse']=True necessary for determinism? Sorry if this is a stupid question.

anivegesana commented 2 years ago

Yeah. In that example, in order to pickle b(4), you would need to pickle b(3) once for sorting and once again for the real pickle, b(2) once in calculating a hash for b(3) in sorting for b(4), once again in calculating b(3) itself, and once again in the final pickle, etc. The reason that we cannot use dynamic programming/caching to do one pass is because the memo (the cache that keeps track of all objects so far) changes after pickling each object and the order in the memo would change based on the order in which objects are pickled, so the memo would become invalid. On top of that, there is no way to rewind the pickler, so we would have to create copies of the pickler that have different memos, but dump their output to a BytesIO. It is a bit of a mess and I am not sure if it would be able to save medium to large objects consisting of sets and dictionaries in a reasonable amount of time.

I also found that there is a difference between hashability (needed to put an object in a dictionary or set) and comparability (needed to sort it), so this same issue applies to dictionaries in general as well:

>>> k1 = object()
>>> k2 = object()
>>> {k1: 7, k2: 4}
{<object object at 0x104d8ce40>: 7, <object object at 0x104d8ce30>: 4}

To avoid complexity, I think we should attempt to sort dictionaries, sets, and frozensets and fallback to not changing them if an incomparable object is present. We can emit a warning if this happens so it is not completely silent, which would be equivalent to a mode that lets you switch between "normal" and "permissive" in your design. If someone wants to use "restrictive", they could use warnings filters to turn it into an error. If you have any other ideas for "full determinism," I'll implement it if it is feasible. (I didn't notice that for "full determinism" you wanted 100% infallible determinism instead of more costly operations that result in more stability. Scroll down to a later comment to see why this is mathematically infeasible for general Python objects.)

Determinism would be able to be a separate feature from recursive functions. You are able to use determinism to create deterministic pickles of sets and dicts.

The feature that you are thinking of in pickle5 is being able to use out-of-band buffers. Unfortunately, I believe that if we were to use those in dill, it would prevent users of the library from using them because there can only be on hook for out-of-band data in pickle5.

anivegesana commented 2 years ago

you can sort anything if you give a sort key that provides a rule for how to sort.

>>> sorted({object(), object()}, key=lambda x:id(x))
[<object object at 0x10ca54560>, <object object at 0x10ca545d0>]

Yes, but the id refers to a unique identifier that doesn't change during the lifetime of the interpreter. Unpickling will create new object which will have different ids. In CPython, the id is just the memory location. Unfortunately, all of this makes it impossible to use id alone to create deterministic pickles.

anivegesana commented 2 years ago

Maybe it doesn't help, depending on how dill works, but I'll leave it here as a possible future approach for sets: the items of the set will be sortable after being pickled.

Fixing the pickle in post might be a reasonable approach to solve this issue. Unfortunately, the objects in the sets could be (and likely are) built from other objects in the pickle file. This makes this kind of difficult because we would need to determine which objects depend on which other objects, build a graph, and perform some operations on the graph to create a deterministic ordering. This is at least as hard as NP because the graph isomorphism problem, which is NP, reduces to it (if we are able to create a truly 100% deterministic pickle, we could test to see if the pickles are equal in order to construct a 100% deterministic equality operator that checks all of the children, grandchildren, etc. of the current object are also the same - i.e. isomorphism).

Nonetheless, I am sure research papers on this topic exist that come up with very reasonable heuristics for solving this problem in polynomial time for most use cases, but it will require a significant amount of time investment to implement. At minimum, it would create pickles that are as good as "permissive/restrictive" mode, so it could be a later addition.

mmckerns commented 2 years ago

I'm not suggesting to use id. The point was, you can sort object instances, given you pick a key that allows them to be ordered. FYI, you may also want to check out one of my other codes, klepto, which can use dill for generating hashes, lots of alternate encoding schemes are possible. If nothing else, it could give you more test cases.

anivegesana commented 2 years ago

Cool. In a vein of thought that connects this too graph isomorphism, I wonder if there is a way to use Weisfeiler-Lehman as the comparison operator between Python objects. Would be interesting and might be a good way to combine all of the thoughts we have here and might be efficient with a little bit of additional caching of results. Would be a rather significant undertaking, but we should try it in the future.

anivegesana commented 2 years ago

Alright. That explanation was a little bit mathy, so here is a brief description of what I considered doing.

It would have been really cool to implement the method @davips suggested (and I converter into a graph problem so we could canonize the graph in breadth-first order instead of depth-first order, so that early termination results in a partial canonization of the pickle, i.e. it is closer to a deterministic pickle after each step of the graph algorithm whereas the recursive method wouldn't make progress until arbitrary lookahead,) but it would either have significant performance overhead (for DF) or would require a rather confusing implementation to convert the pickle into a graph for (BF), so I feel like this problem is best solved by either an external library or a later extension to the dill library.

I feel that assuming that all elements are comparable is a reasonable assumption that occurs in most use cases of this feature and the added overhead of trying to do it in general would be a much rarer occurrence. Most Python data structures are homogeneous and the much more complex case isn't used directly by either pulumi or huggingface. Although end users could use both libraries with data structures with incomparable members, neither library implements fixes to the problem at large, so it may not even be a feature that would be frequently used.

anivegesana commented 2 years ago

@davips @AaronFriel @albertvillanova @lhoestq @mmckerns

For now, this is what I have chosen for the behavior of "deterministic" mode:

Let me know if any of these design decisions need to be changed or if this is good general behavior that approximates determinism well.

davips commented 2 years ago

Sounds wise. Will it be possible to know if the genetated dump is nondeterministic (via warning or in any way) ?

Another question, possibly stupid: is it possible for the user to provide a custom pickling function for dill just like CustomJSONEncoder? From what I've seen here, it seems impossible for some reason.

anivegesana commented 2 years ago

Unfortunately, it is not always possible to know if the generated dump is nondeterministic. Classes that other people define could be nondeterministic themselves and it would be impossible for dill to tell if that was the case. I am planning on putting warnings for set, frozenset, and collections.Counter, so those cases should be covered. It just might not catch everything in general.

The StackOverflow post says that it is possible and gives a way to do it.

def recreate_A(a):
    return A(a)

@dill.register(A)
def save_A(pickler, obj):
    pickler.save_reduce(recreate_A, (obj.a,), obj=obj)

The no was for modifying dill._objects, which probably won't change how dill pickles objects.

lhoestq commented 2 years ago

Let me know if any of these design decisions need to be changed or if this is good general behavior that approximates determinism well.

It sounds good to me :) I agree it would be nice to have a more robust way than overriding save_code or save_function to ignore the line numbers. The solution for the globals dictionary sounds good as well !

anivegesana commented 2 years ago

I agree it would be nice to have a more robust way than overriding save_code or save_function to ignore the line numbers. The solution for the globals dictionary sounds good as well !

I am currently planning on adding a hook (patch_code_object or something similarly named) that will allow for the modification of code objects before they are actually pickled. Unfortunately, that would require huggingface to modify patch_code_object whenever Python makes a change to code objects that effects line numbers. Thankfully, that is relatively rare, but it does make this interface a little inconvenient. I do not have any better alternatives because others might want to retain line numbers. I am open to suggestions.

anivegesana commented 2 years ago

@leogama Since the set and frozenset would conflict with #485, I will leave them out. I can add them in later after we get a working cPickle implementation to prevent prevent issues with compatibility and potentially dropping deterministic pickling of sets if it turns out to be too infeasible to do in cPickle.

leogama commented 2 years ago

For exclusive hashing purposes, if/when dill can work with cPickle, it could have a dill.hash(obj) function that forces the use of "pyPickle" and maybe also have a cache mechanism. The Pickler class hierarchy could be something like:

class BasePickler:  # metaclass? ABC?
    # implements most methods from current 'dill.Pickler'
    # 'dumps', 'loads', 'copy' and 'dump_session' can be implemented as class methods
class Pickler(BasePickler, pickle._Pickler):
    # keep pyPickler as the "standard" for backward compatibility with modifications of '_Pickler.dispatch' by users
    # can be used for a 'dill.hash' function using the "deterministic" mode (including sets)
class cPickler(BasePickler, pickle.Pickler):
    # use this for helper functions 'dill.dump', 'dill.dumps' and 'dill.copy', also 'dill.dump_session':
    # 'dill.dump' is implemented as usual, as it is not used for hashing (I think) and clashes with the method
    # 'dill.dumps', etc:  dumps = cPickler.dumps

With this setting, users should be instructed to use dill.hash for hashing only and dill.dumps for pickling only. If they want to calculate the hash and use the pickle of the same object, they should use Pickler.dumps(obj, deterministic=True) and calculate the hash by themselves.

leogama commented 2 years ago

Unrelated questions about hashing @anivegesana, if you have seen the code of packages that use dill for it:

AaronFriel commented 2 years ago

@leogama chiming in here, here's how Pulumi uses it when we serialize functions - a Python specific capability:

https://github.com/pulumi/pulumi/blob/bcd4ee9909d396b04ad918fff7e140bd854eeb28/sdk/python/lib/pulumi/dynamic/dynamic.py#L212-L247

We're not too concerned with the size of the blob, but we use the default pickle protocol with value 4 in the version of Python we're using.

davips commented 2 years ago

@leogama Thinking from hash-only perspective is a good idea for some cases. Recently, I used decompyle3 + pickle_protocol5 to hash functions and that is just impractical to keep working across new python versions.

anivegesana commented 2 years ago

@leogama

For exclusive hashing purposes, if/when dill can work with cPickle, it could have a dill.hash(obj) function that forces the use of "pyPickle" and maybe also have a cache mechanism. The Pickler class hierarchy could be something like:

Unfortunately, that might not be a good idea because there are many many different types of hashing algorithms, optimized for many different things, like speed, mathematical properties, security, etc. so we should not force people to use the same hash function. I don't know how this would fix the set problem, but it would allow us to fix the code problem since we could select what we want to hash. This sort of idea might be more practical for a optimizer rather than a pickler.

With this setting, users should be instructed to use dill.hash for hashing only and dill.dumps for pickling only.

huggingface uses the hash and the pickle, which would require double computation. Also, it is probably good to have minimal changes to the libraries downstream and they shouldn't have to rewrite their hashing/caching logic for us.

  • Which pickle protocol do they use for hashing? Anything above 3 just adds framing and offloading capabilities and increase the pickle size significantly for small objects...

I do not think that size has to do with this entire process since hashes are fixed length, but I think davips answered the second part of this question perhaps.

  • Do they use memoization or optimization?

There are multiple senses of the word memoization. huggingface caches (memoizes) entire machine learning datasets using PyArrow and a "deterministic" fork of dill. I am assuming that is what you mean since memoization in the Python pickler has to be on to do anything useful since almost every data structure is recursive. It is a historical note in the development of pickle.

None of the libraries use optimization, but we could make an optimizer that canonizes pickles because it only requires adding a single line of code (between the output of the pickler and the input of their hashed) to get working, so I don't think it is a big deal. The only problem is it would have to be done with a graph library to make it fast enough to be reasonable and dill should not have any dependencies (except an optional dependency to numpy which will likely go away soon anyway.)

  • Also fix_imports should be off

fix_imports changes Python 2 pickles that do not contain code into Python 3 pickles. I don't know what you mean by should it be left off since it is in the Python uppickler, not ours. And it effects the unpickler and we use the pickler for hashing.

The incompatibility is that there is no way to override the set reducer in cPickle. Unlike dicts, it is not possible to somehow use persistent id with careful design, there are unlimited sets that can be created that can't be indexed by name.

leogama commented 2 years ago

For exclusive hashing purposes, if/when dill can work with cPickle, it could have a dill.hash(obj) function that forces the use of "pyPickle" and maybe also have a cache mechanism. The Pickler class hierarchy could be something like:

Unfortunately, that might not be a good idea because there are many many different types of hashing algorithms, optimized for many different things, like speed, mathematical properties, security, etc. so we should not force people to use the same hash function.

Of course. I was thinking of having a hash_func parameter for a function with signature f(data: bytes, **kwargs) -> Any, with no default (or with a fast and dumb default like sha1).

I don't know how this would fix the set problem, but it would allow us to fix the code problem since we could select what we want to hash. This sort of idea might be more practical for a optimizer rather than a pickler.

This dill.hash function could do something like this:

def hash(obj, hash_func=hashlib.sha1, protocol=None, byref=None, recurse=None, **kwargs):
    # process dill parameters/settings
    data = BytesIO()
    pickler = Pickler(data, protocol, byref, recurse, deterministic=True)  # pyPickler
    pickler.dispatch = pickler.dispatch.copy()
    pickler.dispatch[set] = pickler.dispatch[frozenset] = save_deterministic_set
    pickler.dump(obj)
    return hash_func(data.getvalue(), **kwargs)

It would be called like:

>>> hash = dill.hash(obj, hashlib.blake2b, salt=b'spam', byref=True).digest()

With this setting, users should be instructed to use dill.hash for hashing only and dill.dumps for pickling only.

huggingface uses the hash and the pickle, which would require double computation. Also, it is probably good to have minimal changes to the libraries downstream and they shouldn't have to rewrite their hashing/caching logic for us.

Won't they have to make minimal changes anyway because the "deterministic" will be off by default, as it implies extra computation too? They can add dumps = dill.Pickler.dumps or even dill.dumps = dill.Pickler.dumps at initialization.

And yes, I was talking about the Pickler memo and pickletools.optimize... But they don't seem to be of much relevance here, idem for protocol.

  • Also fix_imports should be off

fix_imports changes Python 2 pickles that do not contain code into Python 3 pickles. I don't know what you mean by should it be left off since it is in the Python uppickler, not ours. And it effects the unpickler and we use the pickler for hashing.

fix_imports changes Python 3 names into Python 2 compatible names, in the Pickler, and then unpickling reverts the transformation. For hashing, it is just overhead (like using framing) and can even be deprecated in the future, changing the hashes of very simple objects. But dill also modifies the global saving of built-in types by using _load_type, so this is a broader question for pickled value stability (not determinism).

The incompatibility is that there is no way to override the set reducer in cPickle. Unlike dicts, it is not possible to somehow use persistent id with careful design, there are unlimited sets that can be created that can't be indexed by name.

I don't get the distinction between dictionaries and sets. You can make a dictionary deterministic by ordering its (immutable) keys, but there's no natural way of defining order for sets with recursive structures. That's it? It's like the problem of serializing the graph of type inheritance in MRO? Works in "most" cases, but not all.

anivegesana commented 2 years ago
>>> hash = dill.hash(obj, hashlib.blake2b, salt=b'spam', byref=True).digest()

Streaming is an essential property for some hashing protocols. In the case of huggingface datasets, you cannot fit a 1TB dataset in RAM to do the computation you described. Here is how huggingface fixed the issue: https://github.com/huggingface/datasets/blob/6cf0abaed1c2911e5bf23e48e4908929b43d85ae/src/datasets/fingerprint.py

It is best to leave that functionality in huggingface and just keep the same interface for pickling instead of adding a hashing library to dill.

Also, pickling and hashing a 1TB dataset will take a long time (maybe two hours) and we do not want to do a part of that process twice since that will make things take longer, although the pickle likely only contains metadata (which would be much smaller) it would still take a considerable amount of time to recompute stuff that is already computed.

fix_imports changes Python 3 names into Python 2 compatible names, in the Pickler, and then unpickling reverts the transformation. For hashing, it is just overhead (like using framing) and can even be deprecated in the future, changing the hashes of very simple objects. But dill also modifies the global saving of built-in types by using _load_type, so this is a broader question for pickled value stability (not determinism).

Hashes are only stable when pickled with the same protocol with the same version of dill. huggingface invalidates caches from other versions of datasets, so this seems like common practice.

I don't get the distinction between dictionaries and sets. You can make a dictionary deterministic by ordering its (immutable) keys, but there's no natural way of defining order for sets with recursive structures. That's it? It's like the problem of serializing the graph of type inheritance in MRO? Works in "most" cases, but not all.

The distinction is that dictionaries are ordered in insertion order (the order that keys were added to the dictionary) while sets are ordered by the last couple bits of the hash with the couple changing depending on the size of the set. Until Python 3.7, they shared the same implementation (the set one), but dicts changed to insertion order. Learn more from this talk: https://www.youtube.com/watch?v=C4Kc8xzcA68

Explanation of why set canonization is an NP problem: https://github.com/uqfoundation/dill/issues/19#issuecomment-1135169415

An intuition of why it might be harder than it first seems: https://github.com/uqfoundation/dill/issues/19#issuecomment-1135161141

fix_imports changes Python 3 names into Python 2 compatible names, in the Pickler, and then unpickling reverts the transformation. For hashing, it is just overhead (like using framing) and can even be deprecated in the future, changing the hashes of very simple objects. But dill also modifies the global saving of built-in types by using _load_type, so this is a broader question for pickled value stability (not determinism).

Not addressing it in my PR since Python 2 is deprecated.