Open stepleton-xx opened 10 years ago
@stepleton: On my most recent commit, I did notice that the ordering of the _typemap
is different in 3.x [where x = 3, and possibly 2] versus older versions, including the entire 2.x line. However, in some limited testing, I didn't experience any failures to pick up LambdaType
versus FunctionType
, for example. I didn't however stress test it to make sure it absolutely wasn't a problem. _reverse_typemap
as is, should include both LambdaType
and FunctionType
(and the other three cases where there are two identical types). _typemap
has the types themselves as the keys, so there will be 4 less items in _typemap
than in _reverse_typemap
. What does that mean? Does it cover the issue?
@register
can only register the type, so it's equivalent to @register(FunctionType)
or @register(LambdaType)
, I believe. The pickled object shouldn't know what type it is, so as long as either LambdaType
or FunctionType
(reverse type)map to the same type... the lookup using the object as the key should uncover one of the two (LambdaType
or FunctionType
), depending on the possibly random order the dict was populated with.
That was my thinking in it being declared "not a problem".
Do you have any test cases for dill actually failing here? Not that the _typemap
is deterministic, because, as you say, it is quasi-deterministic in +3.3... but a case were dill fails to correctly dump/load a type that it should know. Something like, save a "lambda" to a (session) file, then try loading it 1000 times -- if it fails once or more, then there's a problem. Is this something like the case you had in mind?
You can make the hash randomization deterministic for testing purposes by setting the PYTHONHASHSEED environment variable to a positive integer.
Any news regarding providing determinism for dill? I am using it only for hashing, so is there any workaround?
Are you using Python 3.7+? Because dictionaries are now insertion order, for any particular version of Python after 3.6, the type dictionary should be deterministic, and this issue no longer applies. Since dill 0.3.6 will no longer support Python 2.7, this issue should no longer be relevant if I am not mistaken.
Thanks for the follow-up. It still gives different hashes in python 3.8.10, despite being deterministic if we start a new execution:
In [3]: import dill as d; from hashlib import md5; f = lambda x: x + 1; dum = d.dumps(f); md5(dum).hexdigest()
Out[3]: 'bbe58ab30aec9b1feb80cd3e4b2b7024'
In [4]: import dill as d; from hashlib import md5; f = lambda x: x + 1; dum = d.dumps(f); md5(dum).hexdigest()
Out[4]: '3fc384d6edf117061f781113a60fd053'
I'm not seeing the same behavior in 3.8.13.
Python 3.8.13 (default, May 10 2022, 11:26:38)
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> dill.settings['recurse'] = True
>>> from hashlib import md5
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'65e4cf11903c322b610b68ddb7c71283'
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'65e4cf11903c322b610b68ddb7c71283'
>>>
>>> dill.settings['recurse'] = False
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
>>> f = lambda x:x+1
>>> dum = dill.dumps(f)
>>> md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
The difference in our outputs is not due to the python version, but due to how different consoles handle the context (it seems). If I use python instead of ipython it works, but if I put everything inside the same file and run as a script, it doesn't work anymore. I've seen somewhere in the past about a global dict that stores the module functions, and that changes the content of what is pickled. AFAICR, the problem is the same if you have a file with other functions defined, not necessarily a repetition. E.g.: File A: def f() def h()
File B: def g() def h()
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill; from hashlib import md5; dill.settings['recurse'] = True; f = lambda x:x+1; dum = dill.dumps(f); dill.settings['recurse'] = False; f = lambda x:x+1; dum = dill.dumps(f); md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
>>> import dill; from hashlib import md5; dill.settings['recurse'] = True; f = lambda x:x+1; dum = dill.dumps(f); dill.settings['recurse'] = False; f = lambda x:x+1; dum = dill.dumps(f); md5(dum).hexdigest()
'a882bb1cb887cee113682e6a827455c0'
>>>
I see the behavior in IPython
Python 3.8.13 (default, May 10 2022, 11:26:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.2.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import dill
In [2]: from hashlib import md5
In [3]: f = lambda x:x+1
In [4]: dum = dill.dumps(f)
In [5]: md5(dum).hexdigest()
Out[5]: '2a7b752183b27084ae5b3806d6b6fae3'
In [6]: f = lambda x:x+1
In [7]: dum = dill.dumps(f)
In [8]: md5(dum).hexdigest()
Out[8]: 'f5b16c000549792da22973e74cf4dd3a'
Looks like IPython is doing something to modify the function, or one of it's dependencies (like globals), so I agree, it's probably a contextual issue.
A couple of other Python libraries (like huggingface datasets) depend on persistent hashing of dill pickles, so it is something that would effect at least one popular Python library in production.
What huggingface does is sort the names of the globals dictionary and it mostly works for their purposes. I am planning to rewrite globals in #466, so I am trying to make it easy to add custom processing to global dictionaries outside of dill. And it would, at minimum, benefit the guys over at huggingface.
As huggingface noticed, another big thing holding back the determinism of dill pickles are the line numbers, filename, and other environmental information stored in the code objects. We could add a setting to dill to turn these features off and sort the globals by name.
@davips For a temporary fix if you are still using dill 0.3.4 (they haven't updated the code for dill 0.3.5 yet), you can just snag the code from huggingface datasets (either import their pickler or just copy the code out of their repo.) I will be creating a more sustainable solution that is built into dill.
!pip install datasets
from hashlib import md5
import io
import dill
from datasets.utils.py_utils import Pickler
def dumps(obj, **kwds):
file = io.BytesIO()
Pickler(file, **kwds).dump(obj)
return file.getvalue()
>>> md5(dill.dumps(lambda: 'q')).hexdigest()
d6cd2da806a4384c10bbea6a2c496b3c
>>> md5(dill.dumps(lambda: 'q')).hexdigest()
1d5a198b1ffbe2c3e3118c0ff6a6eb3b
>>> md5(dumps(lambda: 'q')).hexdigest()
211cbff14077367c0c870d82a56f94bb
>>> md5(dumps(lambda: 'q')).hexdigest()
211cbff14077367c0c870d82a56f94bb
Thanks very much @anivegesana ! It seems to work fine this way and should enable resuming some interesting projects here... 🥇
As far as I am aware, sets (and frozensets) are nondeterministic but are also unsortable in some cases:
>>> sorted({object(), object()})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'object' and 'object'
The best that can be done is to either continue to produce nondeterministic pickles only when the objects in the set are incomparable or to produce nondeterministic pickles for sets no matter what.
Maybe it doesn't help, depending on how dill works, but I'll leave it here as a possible future approach for sets: the items of the set will be sortable after being pickled.
you can sort anything if you give a sort key that provides a rule for how to sort.
>>> sorted({object(), object()}, key=lambda x:id(x))
[<object object at 0x10ca54560>, <object object at 0x10ca545d0>]
Maybe it doesn't help, depending on how dill works, but I'll leave it here as a possible future approach for sets: the items of the set will be sortable after being pickled.
Unfortunately, stability of sets is restricted to a few cases (besides being not official) and that doesn't include bytes
.
https://stackoverflow.com/a/45589593/9681577
https://stackoverflow.com/a/58330556/9681577
The order will be lost after puting the serialized items inside a set again, so it would need a custom class for sets that implements a __reduce__
using a list.
I can't get into dill internals now, but if it is a recursive function, seems to be easy:
def dump(obj):
if isinstance(obj, set):
pickled_items = [sorted(dump(item) for item in obj)] # list of bytes objects
return dump(CustomSet(pickled_items))
... other types ...
I didn't take a deep look into it, but I would be against using this to solve the problem since it would have O(n^2) time and space complexity for nestings that look like this:
class A:
def __init__(self, a=None):
self.a=a
def b(o, N):
for i in range(N):
o = {A(o)}
return o
Even worse, this causes infinite recursion:
s = {A(5)}
s[0].a = s
Even for less pathological cases, I would imagine that that solution could be very slow. I didn't have time yet to think about the average case, but I have a hunch that it is just as bad.
Do you say O(n^2)
because the pickled items will be repickled? I'm not specialist in pickle, but I remember that at least protocol=5 does something clever with bytes objects (or numpy arrays as I recall).
Anyway, there are other solutions. For instance, handling sorting case by case is a solution, but some fallback is needed for the cases not addressed (e.g. user classes?). I also don't like a costful fallback, but if that's the only option at the moment to ensure determinism... the user could choose whether to enforce determinism "at all costs" or just for the easier cases. Something like the modes/flags bellow (just some thoughts here, don't take them literally):
About the recursion, is dill.settings['recurse']=True
necessary for determinism? Sorry if this is a stupid question.
Yeah. In that example, in order to pickle b(4), you would need to pickle b(3) once for sorting and once again for the real pickle, b(2) once in calculating a hash for b(3) in sorting for b(4), once again in calculating b(3) itself, and once again in the final pickle, etc. The reason that we cannot use dynamic programming/caching to do one pass is because the memo (the cache that keeps track of all objects so far) changes after pickling each object and the order in the memo would change based on the order in which objects are pickled, so the memo would become invalid. On top of that, there is no way to rewind the pickler, so we would have to create copies of the pickler that have different memos, but dump their output to a BytesIO. It is a bit of a mess and I am not sure if it would be able to save medium to large objects consisting of sets and dictionaries in a reasonable amount of time.
I also found that there is a difference between hashability (needed to put an object in a dictionary or set) and comparability (needed to sort it), so this same issue applies to dictionaries in general as well:
>>> k1 = object()
>>> k2 = object()
>>> {k1: 7, k2: 4}
{<object object at 0x104d8ce40>: 7, <object object at 0x104d8ce30>: 4}
To avoid complexity, I think we should attempt to sort dictionaries, sets, and frozensets and fallback to not changing them if an incomparable object is present. We can emit a warning if this happens so it is not completely silent, which would be equivalent to a mode that lets you switch between "normal" and "permissive" in your design. If someone wants to use "restrictive", they could use warnings filters to turn it into an error. If you have any other ideas for "full determinism," I'll implement it if it is feasible. (I didn't notice that for "full determinism" you wanted 100% infallible determinism instead of more costly operations that result in more stability. Scroll down to a later comment to see why this is mathematically infeasible for general Python objects.)
Determinism would be able to be a separate feature from recursive functions. You are able to use determinism to create deterministic pickles of sets and dicts.
The feature that you are thinking of in pickle5 is being able to use out-of-band buffers. Unfortunately, I believe that if we were to use those in dill, it would prevent users of the library from using them because there can only be on hook for out-of-band data in pickle5.
you can sort anything if you give a sort key that provides a rule for how to sort.
>>> sorted({object(), object()}, key=lambda x:id(x)) [<object object at 0x10ca54560>, <object object at 0x10ca545d0>]
Yes, but the id refers to a unique identifier that doesn't change during the lifetime of the interpreter. Unpickling will create new object which will have different ids. In CPython, the id is just the memory location. Unfortunately, all of this makes it impossible to use id alone to create deterministic pickles.
Maybe it doesn't help, depending on how dill works, but I'll leave it here as a possible future approach for sets: the items of the set will be sortable after being pickled.
Fixing the pickle in post might be a reasonable approach to solve this issue. Unfortunately, the objects in the sets could be (and likely are) built from other objects in the pickle file. This makes this kind of difficult because we would need to determine which objects depend on which other objects, build a graph, and perform some operations on the graph to create a deterministic ordering. This is at least as hard as NP because the graph isomorphism problem, which is NP, reduces to it (if we are able to create a truly 100% deterministic pickle, we could test to see if the pickles are equal in order to construct a 100% deterministic equality operator that checks all of the children, grandchildren, etc. of the current object are also the same - i.e. isomorphism).
Nonetheless, I am sure research papers on this topic exist that come up with very reasonable heuristics for solving this problem in polynomial time for most use cases, but it will require a significant amount of time investment to implement. At minimum, it would create pickles that are as good as "permissive/restrictive" mode, so it could be a later addition.
I'm not suggesting to use id
. The point was, you can sort object instances, given you pick a key that allows them to be ordered. FYI, you may also want to check out one of my other codes, klepto
, which can use dill
for generating hashes, lots of alternate encoding schemes are possible. If nothing else, it could give you more test cases.
Cool. In a vein of thought that connects this too graph isomorphism, I wonder if there is a way to use Weisfeiler-Lehman as the comparison operator between Python objects. Would be interesting and might be a good way to combine all of the thoughts we have here and might be efficient with a little bit of additional caching of results. Would be a rather significant undertaking, but we should try it in the future.
Alright. That explanation was a little bit mathy, so here is a brief description of what I considered doing.
It would have been really cool to implement the method @davips suggested (and I converter into a graph problem so we could canonize the graph in breadth-first order instead of depth-first order, so that early termination results in a partial canonization of the pickle, i.e. it is closer to a deterministic pickle after each step of the graph algorithm whereas the recursive method wouldn't make progress until arbitrary lookahead,) but it would either have significant performance overhead (for DF) or would require a rather confusing implementation to convert the pickle into a graph for (BF), so I feel like this problem is best solved by either an external library or a later extension to the dill library.
I feel that assuming that all elements are comparable is a reasonable assumption that occurs in most use cases of this feature and the added overhead of trying to do it in general would be a much rarer occurrence. Most Python data structures are homogeneous and the much more complex case isn't used directly by either pulumi or huggingface. Although end users could use both libraries with data structures with incomparable members, neither library implements fixes to the problem at large, so it may not even be a feature that would be frequently used.
@davips @AaronFriel @albertvillanova @lhoestq @mmckerns
For now, this is what I have chosen for the behavior of "deterministic" mode:
set
and frozenset
will be sorted before being pickled.
set
and frozenset
will not be effected (and will remain nondeterministic) because they can implement their own __reduce__
functions which don't have to follow the conventions of set
's pickling procedure.complex
), they will be sorted by their hash instead. This will not create a natural order of elements that is easy to understand, but if the __hash__
function of the class doesn't depend on id
, it will be deterministic.dict
and subclasses will remain pickled in insertion order.
code
objects will not have their line numbers removed and the file name and function name will not be modified.
save_code
like they already do, but issues like #488 would make that difficult to maintain and it would be best to try to find a way to emit information that would help users selectively zero out/ignore line information in code objects when hashing. That way, information that debuggers need still remains in the pickle, but the hash remains consistent.Let me know if any of these design decisions need to be changed or if this is good general behavior that approximates determinism well.
Sounds wise. Will it be possible to know if the genetated dump is nondeterministic (via warning or in any way) ?
Another question, possibly stupid: is it possible for the user to provide a custom pickling function for dill just like CustomJSONEncoder? From what I've seen here, it seems impossible for some reason.
Unfortunately, it is not always possible to know if the generated dump is nondeterministic. Classes that other people define could be nondeterministic themselves and it would be impossible for dill to tell if that was the case. I am planning on putting warnings for set
, frozenset
, and collections.Counter
, so those cases should be covered. It just might not catch everything in general.
The StackOverflow post says that it is possible and gives a way to do it.
def recreate_A(a):
return A(a)
@dill.register(A)
def save_A(pickler, obj):
pickler.save_reduce(recreate_A, (obj.a,), obj=obj)
The no was for modifying dill._objects
, which probably won't change how dill pickles objects.
Let me know if any of these design decisions need to be changed or if this is good general behavior that approximates determinism well.
It sounds good to me :) I agree it would be nice to have a more robust way than overriding save_code
or save_function
to ignore the line numbers. The solution for the globals dictionary sounds good as well !
I agree it would be nice to have a more robust way than overriding
save_code
orsave_function
to ignore the line numbers. The solution for the globals dictionary sounds good as well !
I am currently planning on adding a hook (patch_code_object
or something similarly named) that will allow for the modification of code objects before they are actually pickled. Unfortunately, that would require huggingface to modify patch_code_object
whenever Python makes a change to code objects that effects line numbers. Thankfully, that is relatively rare, but it does make this interface a little inconvenient. I do not have any better alternatives because others might want to retain line numbers. I am open to suggestions.
@leogama Since the set and frozenset would conflict with #485, I will leave them out. I can add them in later after we get a working cPickle implementation to prevent prevent issues with compatibility and potentially dropping deterministic pickling of sets if it turns out to be too infeasible to do in cPickle.
For exclusive hashing purposes, if/when dill can work with cPickle, it could have a dill.hash(obj)
function that forces the use of "pyPickle" and maybe also have a cache mechanism. The Pickler class hierarchy could be something like:
class BasePickler: # metaclass? ABC?
# implements most methods from current 'dill.Pickler'
# 'dumps', 'loads', 'copy' and 'dump_session' can be implemented as class methods
class Pickler(BasePickler, pickle._Pickler):
# keep pyPickler as the "standard" for backward compatibility with modifications of '_Pickler.dispatch' by users
# can be used for a 'dill.hash' function using the "deterministic" mode (including sets)
class cPickler(BasePickler, pickle.Pickler):
# use this for helper functions 'dill.dump', 'dill.dumps' and 'dill.copy', also 'dill.dump_session':
# 'dill.dump' is implemented as usual, as it is not used for hashing (I think) and clashes with the method
# 'dill.dumps', etc: dumps = cPickler.dumps
With this setting, users should be instructed to use dill.hash
for hashing only and dill.dumps
for pickling only. If they want to calculate the hash and use the pickle of the same object, they should use Pickler.dumps(obj, deterministic=True)
and calculate the hash by themselves.
Unrelated questions about hashing @anivegesana, if you have seen the code of packages that use dill
for it:
fix_imports
should be off
Just curious...@leogama chiming in here, here's how Pulumi uses it when we serialize functions - a Python specific capability:
We're not too concerned with the size of the blob, but we use the default pickle protocol with value 4 in the version of Python we're using.
@leogama Thinking from hash-only perspective is a good idea for some cases. Recently, I used decompyle3 + pickle_protocol5
to hash functions and that is just impractical to keep working across new python versions.
@leogama
For exclusive hashing purposes, if/when dill can work with cPickle, it could have a
dill.hash(obj)
function that forces the use of "pyPickle" and maybe also have a cache mechanism. The Pickler class hierarchy could be something like:
Unfortunately, that might not be a good idea because there are many many different types of hashing algorithms, optimized for many different things, like speed, mathematical properties, security, etc. so we should not force people to use the same hash function. I don't know how this would fix the set problem, but it would allow us to fix the code
problem since we could select what we want to hash. This sort of idea might be more practical for a optimizer rather than a pickler.
With this setting, users should be instructed to use
dill.hash
for hashing only anddill.dumps
for pickling only.
huggingface uses the hash and the pickle, which would require double computation. Also, it is probably good to have minimal changes to the libraries downstream and they shouldn't have to rewrite their hashing/caching logic for us.
- Which pickle protocol do they use for hashing? Anything above 3 just adds framing and offloading capabilities and increase the pickle size significantly for small objects...
I do not think that size has to do with this entire process since hashes are fixed length, but I think davips answered the second part of this question perhaps.
- Do they use memoization or optimization?
There are multiple senses of the word memoization. huggingface caches (memoizes) entire machine learning datasets using PyArrow and a "deterministic" fork of dill. I am assuming that is what you mean since memoization in the Python pickler has to be on to do anything useful since almost every data structure is recursive. It is a historical note in the development of pickle.
None of the libraries use optimization, but we could make an optimizer that canonizes pickles because it only requires adding a single line of code (between the output of the pickler and the input of their hashed) to get working, so I don't think it is a big deal. The only problem is it would have to be done with a graph library to make it fast enough to be reasonable and dill should not have any dependencies (except an optional dependency to numpy which will likely go away soon anyway.)
- Also
fix_imports
should be off
fix_imports
changes Python 2 pickles that do not contain code into Python 3 pickles. I don't know what you mean by should it be left off since it is in the Python uppickler, not ours. And it effects the unpickler and we use the pickler for hashing.
The incompatibility is that there is no way to override the set reducer in cPickle. Unlike dict
s, it is not possible to somehow use persistent id with careful design, there are unlimited sets that can be created that can't be indexed by name.
For exclusive hashing purposes, if/when dill can work with cPickle, it could have a
dill.hash(obj)
function that forces the use of "pyPickle" and maybe also have a cache mechanism. The Pickler class hierarchy could be something like:Unfortunately, that might not be a good idea because there are many many different types of hashing algorithms, optimized for many different things, like speed, mathematical properties, security, etc. so we should not force people to use the same hash function.
Of course. I was thinking of having a hash_func
parameter for a function with signature f(data: bytes, **kwargs) -> Any
, with no default (or with a fast and dumb default like sha1
).
I don't know how this would fix the set problem, but it would allow us to fix the
code
problem since we could select what we want to hash. This sort of idea might be more practical for a optimizer rather than a pickler.
This dill.hash
function could do something like this:
def hash(obj, hash_func=hashlib.sha1, protocol=None, byref=None, recurse=None, **kwargs):
# process dill parameters/settings
data = BytesIO()
pickler = Pickler(data, protocol, byref, recurse, deterministic=True) # pyPickler
pickler.dispatch = pickler.dispatch.copy()
pickler.dispatch[set] = pickler.dispatch[frozenset] = save_deterministic_set
pickler.dump(obj)
return hash_func(data.getvalue(), **kwargs)
It would be called like:
>>> hash = dill.hash(obj, hashlib.blake2b, salt=b'spam', byref=True).digest()
With this setting, users should be instructed to use
dill.hash
for hashing only anddill.dumps
for pickling only.huggingface uses the hash and the pickle, which would require double computation. Also, it is probably good to have minimal changes to the libraries downstream and they shouldn't have to rewrite their hashing/caching logic for us.
Won't they have to make minimal changes anyway because the "deterministic" will be off by default, as it implies extra computation too? They can add dumps = dill.Pickler.dumps
or even dill.dumps = dill.Pickler.dumps
at initialization.
And yes, I was talking about the Pickler memo and pickletools.optimize
... But they don't seem to be of much relevance here, idem for protocol.
- Also
fix_imports
should be off
fix_imports
changes Python 2 pickles that do not contain code into Python 3 pickles. I don't know what you mean by should it be left off since it is in the Python uppickler, not ours. And it effects the unpickler and we use the pickler for hashing.
fix_imports
changes Python 3 names into Python 2 compatible names, in the Pickler, and then unpickling reverts the transformation. For hashing, it is just overhead (like using framing) and can even be deprecated in the future, changing the hashes of very simple objects. But dill also modifies the global saving of built-in types by using _load_type
, so this is a broader question for pickled value stability (not determinism).
The incompatibility is that there is no way to override the set reducer in cPickle. Unlike
dict
s, it is not possible to somehow use persistent id with careful design, there are unlimited sets that can be created that can't be indexed by name.
I don't get the distinction between dictionaries and sets. You can make a dictionary deterministic by ordering its (immutable) keys, but there's no natural way of defining order for sets with recursive structures. That's it? It's like the problem of serializing the graph of type inheritance in MRO? Works in "most" cases, but not all.
>>> hash = dill.hash(obj, hashlib.blake2b, salt=b'spam', byref=True).digest()
Streaming is an essential property for some hashing protocols. In the case of huggingface datasets, you cannot fit a 1TB dataset in RAM to do the computation you described. Here is how huggingface fixed the issue: https://github.com/huggingface/datasets/blob/6cf0abaed1c2911e5bf23e48e4908929b43d85ae/src/datasets/fingerprint.py
It is best to leave that functionality in huggingface and just keep the same interface for pickling instead of adding a hashing library to dill.
Also, pickling and hashing a 1TB dataset will take a long time (maybe two hours) and we do not want to do a part of that process twice since that will make things take longer, although the pickle likely only contains metadata (which would be much smaller) it would still take a considerable amount of time to recompute stuff that is already computed.
fix_imports changes Python 3 names into Python 2 compatible names, in the Pickler, and then unpickling reverts the transformation. For hashing, it is just overhead (like using framing) and can even be deprecated in the future, changing the hashes of very simple objects. But dill also modifies the global saving of built-in types by using _load_type, so this is a broader question for pickled value stability (not determinism).
Hashes are only stable when pickled with the same protocol with the same version of dill. huggingface invalidates caches from other versions of datasets, so this seems like common practice.
I don't get the distinction between dictionaries and sets. You can make a dictionary deterministic by ordering its (immutable) keys, but there's no natural way of defining order for sets with recursive structures. That's it? It's like the problem of serializing the graph of type inheritance in MRO? Works in "most" cases, but not all.
The distinction is that dictionaries are ordered in insertion order (the order that keys were added to the dictionary) while sets are ordered by the last couple bits of the hash with the couple changing depending on the size of the set. Until Python 3.7, they shared the same implementation (the set one), but dicts changed to insertion order. Learn more from this talk: https://www.youtube.com/watch?v=C4Kc8xzcA68
Explanation of why set canonization is an NP problem: https://github.com/uqfoundation/dill/issues/19#issuecomment-1135169415
An intuition of why it might be harder than it first seems: https://github.com/uqfoundation/dill/issues/19#issuecomment-1135161141
fix_imports changes Python 3 names into Python 2 compatible names, in the Pickler, and then unpickling reverts the transformation. For hashing, it is just overhead (like using framing) and can even be deprecated in the future, changing the hashes of very simple objects. But dill also modifies the global saving of built-in types by using _load_type, so this is a broader question for pickled value stability (not determinism).
Not addressing it in my PR since Python 2 is deprecated.
Pickles of functions (and pickles of things that contain functions, like classes) are not quite deterministic---they depend on iteration order of the
_reverse_typemap
inside dill.py. Depending on the order, either the symbol "LambdaType" or "FunctionType" will be used to represent functions. Either will work as far as unpickling goes, but having different representations of the same value can cause trouble with e.g. caching.While most invocations of Python 2.x yield the same iteration order for
_reverse_typemap
, use of the -R flag (recommended for user-facing services; c.f. http://www.ocert.org/advisories/ocert-2011-003.html) randomizes this order.Note that the functionality of -R is on by default for versions >= 3.3: http://docs.python.org/3/whatsnew/3.3.html