Open mmckerns opened 9 years ago
Well, the extreme case of this is session pickling a whole set of modules, shipping them to a machine without the modules and a different OS, and expecting it all to be there. We could do it, but pickles would be large, so it would have to be another option.
And if said module contains unpicklables, we are in trouble.
Yeah, I don't know if you read my reply to the question on SO. I'm not sure if it's a good idea or not…
Are we not pickling the full module for user modules anyway? I think we include module contents in the pickle.
Currently, yes, unless diff is enabled. Not all of the classes are pickled, though, and function are pickled by reference, aren't they?
Yes, standardly defined functions inside a module are pickled by reference -- as you can see with the F2
tag below (note the last three and quad_factory
).
>>> import dill
>>> dill.detect.trace(True)
>>> import test_mixins
>>> z = dill.dumps(test_mixins)
M1: <module 'test_mixins' from 'test_mixins.pyc'>
F2: <function _import_module at 0x110212e60>
D2: <dict object at 0x11024bb40>
M2: <module 'dill' from '/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.4.dev0-py2.7.egg/dill/__init__.pyc'>
F1: <function func at 0x110277848>
F2: <function _create_function at 0x1102125f0>
Co: <code object func at 0x1102749b0, file "test_mixins.py", line 25>
F2: <function _unmarshal at 0x110212488>
D2: <dict object at 0x11028a910>
B2: <built-in function sum>
Ce: <cell at 0x110276558: int object at 0x7fc1e3c0c3f0>
F2: <function _create_cell at 0x1102128c0>
Ce: <cell at 0x110276590: int object at 0x7fc1e3c0c3c0>
Ce: <cell at 0x1102765c8: int object at 0x7fc1e3c0c3f0>
Ce: <cell at 0x110276670: function object at 0x1102777d0>
F1: <function double_add at 0x1102777d0>
Co: <code object double_add at 0x110274b30, file "test_mixins.py", line 36>
D2: <dict object at 0x11028ab40>
D2: <dict object at 0x110234050>
Ce: <cell at 0x110276638: list object at 0x110237440>
D2: <dict object at 0x110232d70>
F1: <function invert at 0x1102776e0>
Co: <code object invert at 0x110274830, file "test_mixins.py", line 22>
D2: <dict object at 0x11028ad70>
D2: <dict object at 0x11023e6e0>
F2: <function quad_factory at 0x110277758>
F1: <function dec at 0x1102778c0>
Co: <code object dec at 0x110274c30, file "test_mixins.py", line 44>
D2: <dict object at 0x11028b168>
Ce: <cell at 0x110276750: int object at 0x7fc1e3c0c3d8>
Ce: <cell at 0x110276788: int object at 0x7fc1e3c0c3d8>
Ce: <cell at 0x1102767c0: int object at 0x7fc1e3c0c3f0>
D2: <dict object at 0x110244d70>
F1: <function func at 0x1102779b0>
Co: <code object func at 0x110274bb0, file "test_mixins.py", line 45>
D2: <dict object at 0x11028b5c8>
Ce: <cell at 0x110276600: int object at 0x7fc1e3c0c3f0>
Ce: <cell at 0x1102766a8: int object at 0x7fc1e3c0c390>
Ce: <cell at 0x1102766e0: int object at 0x7fc1e3c0c3f0>
Ce: <cell at 0x110276718: function object at 0x110277938>
F1: <function quadish at 0x110277938>
Co: <code object quadish at 0x110274d30, file "test_mixins.py", line 51>
D2: <dict object at 0x11028b7f8>
D2: <dict object at 0x11023e280>
D2: <dict object at 0x11024bd70>
F1: <function inner at 0x110277b18>
Co: <code object inner at 0x110274db0, file "test_mixins.py", line 58>
D2: <dict object at 0x11028bc58>
Ce: <cell at 0x1102767f8: function object at 0x110277aa0>
F1: <function quadruple at 0x110277aa0>
Co: <code object quadruple at 0x110274eb0, file "test_mixins.py", line 63>
D2: <dict object at 0x11028be88>
D2: <dict object at 0x11023ea28>
D2: <dict object at 0x1102864b0>
F2: <function wtf at 0x1102775f0>
F2: <function doubler at 0x110277a28>
F2: <function quad at 0x110277668>
>>>
Currently, it's only possible to disable pickling by reference for classes. It would be fairly easy to extend this to modules, functions, and other objects… I believe, as dill.source.getsource
is pretty good at getting source code for objects defined in modules (it mainly leverages inspect
in these cases)
After some deliberation, I think I categorize this as a feature request as opposed to a bug. I think dill
has the right (default) behavior currently, but it should be possible to ship a module without requiring it to be in the PYTHONPATH
of the target machine. If there are unpicklable items therein, then so be it… it will fail. It could be handled quite simply in porting the source of the module. It opens up a huge can of worms… and will likely fail if there are dependencies that are also not shipped. However, how different is this than the options that exist in dill
for shipping a file associated with a python file
object?
Well, currently we have a sought of HANDLE_FMODE
for modules, and this feature wants a FILE_FMODE
for modules. We could almost add a dill_mode
option, with REF
, PARTIAL
and COMPLETE
as values, and write 3 versions of every function.
Yes, I agree… that's, in reality, where things seem to be headed. That could take some significant work.
@sanbales: I originally built dill
so it could support mystic
(see: https://pypi.python.org/pypi/mystic)… where mystic
is similar to what you are developing, and leverages dill
for massively-parallel and distributed computing. In short, yes, there's interest -- and no, there's been no new progress toward it. If you'd like to pitch in, submit a PR.
@mmckerns: Thank you for the pointer to mystic
and the fast reply. I'll try to see if I can make some progress. If the progress merits it, I'll submit a PR. Cheers!
I've hit this (functions pickling by reference) as a minor annoyance in some tests. I'm also shipping off pickled functions to computing clusters, and have some pytest test files where functions are defined. Effectively, I'm launching a separate (non-forked) process and the function is attempted to be unpickled there.
However, pytest appears to load the test file without directly importing it or adding it's path to the pythonpath. So when the function is pickled, it references what appears to be a valid local module but actually isn't.
Amusingly, this is only a problem with functions in the tests - dill appears to do the right thing in most other cases. I've worked around it just by adding the test directory to the remote instances launched as a part of the test, but definitely "Force reference=false" would be helpful for functions, as long as the module self-reference could be dropped.
Hello, I'm having what seems to be a very similar issue. If I run the my code in my debugger the dill.dump() and dill.load() work perfectly fine as expected. When I run my code externally using an application that uses my python code (the same one i ran in the debugger), the dill.dump() throws an error saying:
Can't pickle <class 'simulator_utilities.SimulationParameters'>: it's not found as simulator_utilities.SimulationParameters
I decided to check the name parameter and the object.module parameter, when I run my code in the debugger, and in the application, and I get the same exact parameters, all four says 'simulator_utilities'.
I'm at a loss, is there a known way to force this to work? I tried adding the directory containing simulator_utilities.py to my environment variables, but that didn't help.
def save_sim_params(self):
# Save the simulation parameters in the local variable
try:
self.export_sim_param_frame_data()
root = Tk()
# self.__name__ = 'simulator_utilities'
save_path = filedialog.askopenfilename(parent=root) # Ask user to file to save to
print("__name__")
print(__name__)
print(self.sim_params.__module__)
dill.dump(self.sim_params, open(save_path, 'wb'), byref=False)
root.destroy()
except:
AerialOptimizer.PrintException()
print("Try saving again")
root.destroy()
This code is part of a class called GUI() thats in the simulator_utiltites.py file, and the class SimulationParameters is also in the simulator_utilities.py file.
EDIT: So i tried a hacky solution right after posting this, and I really don't like this solution as it makes my code not portable to other computers, but i guess its okay for now. I just appended the directory that have simulator_utilities.py. Messy but oh well
import sys
sys.path.append('D:\Documents\DeepToot\RLBOT\simulator_bot')
May I propose a portable solution for this: let the user be responsible (at pickling time) for specifying which paths to be included at unpickling time. For more flexibility, the user can pass a function that takes care of this path manipulation. The function could involves collapseuser
, expanduser
methods of Path
for portability.
@thisismygitrepo: Do you want the user, during the dump
, to specify the paths on both the source and target resources? If so, I'm not sure it's a good idea. It has the potential to make the pickles very resource specific.
@leogama: FYI wrt #525, #527, etc.
@mmckerns I have an object that I want to unpickle, but I can't achieve that before first changing directory to the repo from which it came from, or equivalently adding that to PYTHONPATH before unpickling.
Thus, I, the user, am already responsible for doing this. Why can't we just formalize this contract between the user and the library at dump time?
Lastly, I figured out a workaround
Could be too much for an average user.
Your "package workaround" and stated rules about extending the PYTHONPATH for "scripts" are exactly what I designed the interface to be. So, this is good (because installing it as a package essentially puts the module on the PYTHONPATH).
The design decision was that module dependencies need to be available somewhere on the PYTHONPATH. I think that's reasonable. I'm open to some setting that enables recursive module dependencies, or something similar, but it needs some discussion (and work if deemed worthwhile).
After the recent changes to the dump_module()
(previously dump_session()
), I'm not certain of what should dump(module)
do for user modules. dump_module()
is meant to save a module's state, not the module itself, so the assumption that the module should be available at loading makes sense.
dump(module)
for user modules saves most of their contents but still saves classes by reference. Therefore the module also needs to be importable at loading.
Regarding the idea of doing some operation prior to loading a pickle, it's perfectly possible to create composed pickle files with two or more pickle streams. It may be used to implement some kind of "load hook". Using that early example:
import os
import dill
import foo
foo_path = os.path.dirname(foo.__spec__.origin)
def load_hook():
import sys
if foo_path not in sys.path:
sys.path.append(foo_path)
my_file = "./foo.pkl"
with open(my_file, 'wb') as file:
dill.dump(load_hook, file, recurse=True)
dill.dump(foo, file)
At a different session in which foo
is not in sys.path
:
import dill
my_file = "./foo.pkl"
with open(my_file, 'rb') as file:
load_hook = dill.load(file)
load_hook()
foo = dill.load(file)
print(foo)
I know I'm restating this a bit, but to be clear... My expected usage pattern has been that if a module is to be dumped and shipped to be loaded elsewhere, that the module dependencies should either be already installed as packages, or the user can ship them with a service like pathos
or ppft
(both of which use dill.source
), or install the packages dynamically with a service like pox
. Shipping multiple scripts (i.e. uninstalled modules with dependencies that are not guaranteed to be on the PYHTONPATH), to me, starts to hedge into the territory of a package installers and the like. The load hook, as suggested by @leogama, is a very reasonable approach... and can be used to alter the PTYHONPATH, or install a module, or a number of other module support options.
I've just read the entire thread. Shipping whole modules is complicated because of dependencies and stuff. But there's already a partial solution for this in Standard Library that is zipped modules, which can hold entire packages in a single file. These can be imported directly with zipimport
.
There's even an interesting possibility of concatenation a pickle file (e.g. from dump_module()
) with a ZIP archive containing the same module whose state is saved in the pickle. Pickle streams may have any data appended to it, while ZIP archives may have any data prepended to it (they are read from the end!).
About the "load_hook" ideia, I'm thinking that the preload function's pickle stream should go after the object's stream, so that it would still read as a normal pickle file by load()
.
Such a file would follow some format like:
To read the pickle executing the preload function before it, the reader must do roughly:
N
N
bytesThe stored stream size could be of stream (1) too. Maybe it's even better because it leaves space to change the file format in the future.
Thinking about this "preload hook", I think I'll move the file-related stuff (_open()
, _PeekableReader
and _TruncatableWriter
) from the session
submodule to _utils
, and rename truncatable
to seekable
—seek and truncate capabilities seem to go together in the io
module— before it is consolidated by the coming release.
And here is a draft of an API proposal for the "preload hook" mechanism:
import foo
import dill
# Create an object that needs some setup before unpickling.
obj = foo.SomeClass('example')
# Define the preload function.
def setup_foo()
#code: set up things in foo for loading obj
with open('obj.pkl', 'wb') as file:
dill.dump(obj, file, preload_hook=setup_foo)
In a different session:
import dill
with open('obj.pkl', 'rb') as file:
obj = dill.load(file, exec_preload=True) # load and call setup_foo(), then load obj
Alternatively:
import dill
with open('obj.pkl', 'rb') as file:
obj = dill.load(file) # just load obj, as if it was a common pickle file
Some important design aspects are:
preload_hook
function is pickled with recurse=True
.exec_preload
option, with no setting to bypass it.Haha! Look at what the top-rated unanswered question in SO (https://stackoverflow.com/questions/44560416/pickle-a-dynamically-imported-class) is about.
Edit: also the second top-most: https://stackoverflow.com/questions/42613964/serializing-custom-modules-together-with-object-in-python
I think I'll move the... [snip]... before it is consolidated by the coming release.
@leogama: The reason we are still hung up is that we pulled an incomplete solution. When that's resolved or rolled back, we will release... we shouldn't be messing with anything else unless there's a good reason (e.g. the refactor is to correct a design flaw from leaking into a release).
@leogama: The reason we are still hung up is that we pulled an incomplete solution.
What are you referring to? #475 or #527?
i was referring to issues stemming from #507 and #526. (Nominally to be resolved with #527).
Question from SO: http://stackoverflow.com/questions/31884640/does-the-dill-python-module-handle-importing-modules-when-sys-path-differs
Can I use dill to serialize and then load that module in a different process that has a different sys.path which doesn't include that module? Right now I get import failures:
Here's an example. I run this script where the foo.py module's path is in my sys.path:
Now, I run this script where I do not have foo.py's directory in my PYTHONPATH:
It fails with this stack trace:
So, if I need to have the same python path between the two processes, then what's the point of serializing a python module?