uqfoundation / dill

serialize all of Python
http://dill.rtfd.io
Other
2.28k stars 181 forks source link

serialized module fails if module is not in PYTHONPATH #123

Open mmckerns opened 9 years ago

mmckerns commented 9 years ago

Question from SO: http://stackoverflow.com/questions/31884640/does-the-dill-python-module-handle-importing-modules-when-sys-path-differs

Can I use dill to serialize and then load that module in a different process that has a different sys.path which doesn't include that module? Right now I get import failures:

Here's an example. I run this script where the foo.py module's path is in my sys.path:

    % cat dill_dump.py 
    import dill
    import foo
    myFile = "./foo.pkl"
    fh = open(myFile, 'wb')
    dill.dump(foo, fh)

Now, I run this script where I do not have foo.py's directory in my PYTHONPATH:

    % cat dill_load.py 
    import dill
    myFile = "./foo.pkl"
    fh = open(myFile, 'rb')
    foo = dill.load(fh)
    print foo

It fails with this stack trace:

    Traceback (most recent call last):
      File "dill_load.py", line 4, in <module>
        foo = dill.load(fh)
      File "/home/b/lib/python/dill-0.2.4-py2.6.egg/dill/dill.py", line 199, in load
        obj = pik.load()
      File "/rel/lang/python/2.6.4-8/lib/python2.6/pickle.py", line 858, in load
        dispatch[key](self)
      File "/rel/lang/python/2.6.4-8/lib/python2.6/pickle.py", line 1133, in load_reduce
        value = func(*args)
      File "/home/b/lib/python/dill-0.2.4-py2.6.egg/dill/dill.py", line 678, in _import_module
        return __import__(import_name)
    ImportError: No module named foo

So, if I need to have the same python path between the two processes, then what's the point of serializing a python module?

matsjoyce commented 9 years ago

Well, the extreme case of this is session pickling a whole set of modules, shipping them to a machine without the modules and a different OS, and expecting it all to be there. We could do it, but pickles would be large, so it would have to be another option.

matsjoyce commented 9 years ago

And if said module contains unpicklables, we are in trouble.

mmckerns commented 9 years ago

Yeah, I don't know if you read my reply to the question on SO. I'm not sure if it's a good idea or not…

mmckerns commented 9 years ago

Are we not pickling the full module for user modules anyway? I think we include module contents in the pickle.

matsjoyce commented 9 years ago

Currently, yes, unless diff is enabled. Not all of the classes are pickled, though, and function are pickled by reference, aren't they?

mmckerns commented 9 years ago

Yes, standardly defined functions inside a module are pickled by reference -- as you can see with the F2 tag below (note the last three and quad_factory).

>>> import dill
>>> dill.detect.trace(True)
>>> import test_mixins
>>> z = dill.dumps(test_mixins)
M1: <module 'test_mixins' from 'test_mixins.pyc'>
F2: <function _import_module at 0x110212e60>
D2: <dict object at 0x11024bb40>
M2: <module 'dill' from '/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.4.dev0-py2.7.egg/dill/__init__.pyc'>
F1: <function func at 0x110277848>
F2: <function _create_function at 0x1102125f0>
Co: <code object func at 0x1102749b0, file "test_mixins.py", line 25>
F2: <function _unmarshal at 0x110212488>
D2: <dict object at 0x11028a910>
B2: <built-in function sum>
Ce: <cell at 0x110276558: int object at 0x7fc1e3c0c3f0>
F2: <function _create_cell at 0x1102128c0>
Ce: <cell at 0x110276590: int object at 0x7fc1e3c0c3c0>
Ce: <cell at 0x1102765c8: int object at 0x7fc1e3c0c3f0>
Ce: <cell at 0x110276670: function object at 0x1102777d0>
F1: <function double_add at 0x1102777d0>
Co: <code object double_add at 0x110274b30, file "test_mixins.py", line 36>
D2: <dict object at 0x11028ab40>
D2: <dict object at 0x110234050>
Ce: <cell at 0x110276638: list object at 0x110237440>
D2: <dict object at 0x110232d70>
F1: <function invert at 0x1102776e0>
Co: <code object invert at 0x110274830, file "test_mixins.py", line 22>
D2: <dict object at 0x11028ad70>
D2: <dict object at 0x11023e6e0>
F2: <function quad_factory at 0x110277758>
F1: <function dec at 0x1102778c0>
Co: <code object dec at 0x110274c30, file "test_mixins.py", line 44>
D2: <dict object at 0x11028b168>
Ce: <cell at 0x110276750: int object at 0x7fc1e3c0c3d8>
Ce: <cell at 0x110276788: int object at 0x7fc1e3c0c3d8>
Ce: <cell at 0x1102767c0: int object at 0x7fc1e3c0c3f0>
D2: <dict object at 0x110244d70>
F1: <function func at 0x1102779b0>
Co: <code object func at 0x110274bb0, file "test_mixins.py", line 45>
D2: <dict object at 0x11028b5c8>
Ce: <cell at 0x110276600: int object at 0x7fc1e3c0c3f0>
Ce: <cell at 0x1102766a8: int object at 0x7fc1e3c0c390>
Ce: <cell at 0x1102766e0: int object at 0x7fc1e3c0c3f0>
Ce: <cell at 0x110276718: function object at 0x110277938>
F1: <function quadish at 0x110277938>
Co: <code object quadish at 0x110274d30, file "test_mixins.py", line 51>
D2: <dict object at 0x11028b7f8>
D2: <dict object at 0x11023e280>
D2: <dict object at 0x11024bd70>
F1: <function inner at 0x110277b18>
Co: <code object inner at 0x110274db0, file "test_mixins.py", line 58>
D2: <dict object at 0x11028bc58>
Ce: <cell at 0x1102767f8: function object at 0x110277aa0>
F1: <function quadruple at 0x110277aa0>
Co: <code object quadruple at 0x110274eb0, file "test_mixins.py", line 63>
D2: <dict object at 0x11028be88>
D2: <dict object at 0x11023ea28>
D2: <dict object at 0x1102864b0>
F2: <function wtf at 0x1102775f0>
F2: <function doubler at 0x110277a28>
F2: <function quad at 0x110277668>
>>> 
mmckerns commented 9 years ago

Currently, it's only possible to disable pickling by reference for classes. It would be fairly easy to extend this to modules, functions, and other objects… I believe, as dill.source.getsource is pretty good at getting source code for objects defined in modules (it mainly leverages inspect in these cases)

mmckerns commented 9 years ago

After some deliberation, I think I categorize this as a feature request as opposed to a bug. I think dill has the right (default) behavior currently, but it should be possible to ship a module without requiring it to be in the PYTHONPATH of the target machine. If there are unpicklable items therein, then so be it… it will fail. It could be handled quite simply in porting the source of the module. It opens up a huge can of worms… and will likely fail if there are dependencies that are also not shipped. However, how different is this than the options that exist in dill for shipping a file associated with a python file object?

matsjoyce commented 9 years ago

Well, currently we have a sought of HANDLE_FMODE for modules, and this feature wants a FILE_FMODE for modules. We could almost add a dill_mode option, with REF, PARTIAL and COMPLETE as values, and write 3 versions of every function.

mmckerns commented 9 years ago

Yes, I agree… that's, in reality, where things seem to be headed. That could take some significant work.

sanbales commented 7 years ago

I was wondering if there was still interest in adding this functionality to dill. It would be extremely valuable for some functionality we are trying to develop to ship instances of OpenMDAO's Problems to high performance computers.

mmckerns commented 7 years ago

@sanbales: I originally built dill so it could support mystic (see: https://pypi.python.org/pypi/mystic)… where mystic is similar to what you are developing, and leverages dill for massively-parallel and distributed computing. In short, yes, there's interest -- and no, there's been no new progress toward it. If you'd like to pitch in, submit a PR.

sanbales commented 7 years ago

@mmckerns: Thank you for the pointer to mystic and the fast reply. I'll try to see if I can make some progress. If the progress merits it, I'll submit a PR. Cheers!

ndevenish commented 6 years ago

I've hit this (functions pickling by reference) as a minor annoyance in some tests. I'm also shipping off pickled functions to computing clusters, and have some pytest test files where functions are defined. Effectively, I'm launching a separate (non-forked) process and the function is attempted to be unpickled there.

However, pytest appears to load the test file without directly importing it or adding it's path to the pythonpath. So when the function is pickled, it references what appears to be a valid local module but actually isn't.

Amusingly, this is only a problem with functions in the tests - dill appears to do the right thing in most other cases. I've worked around it just by adding the test directory to the remote instances launched as a part of the test, but definitely "Force reference=false" would be helpful for functions, as long as the module self-reference could be dropped.

mgiglia92 commented 4 years ago

Hello, I'm having what seems to be a very similar issue. If I run the my code in my debugger the dill.dump() and dill.load() work perfectly fine as expected. When I run my code externally using an application that uses my python code (the same one i ran in the debugger), the dill.dump() throws an error saying:

Can't pickle <class 'simulator_utilities.SimulationParameters'>: it's not found as simulator_utilities.SimulationParameters

I decided to check the name parameter and the object.module parameter, when I run my code in the debugger, and in the application, and I get the same exact parameters, all four says 'simulator_utilities'.

I'm at a loss, is there a known way to force this to work? I tried adding the directory containing simulator_utilities.py to my environment variables, but that didn't help.

def save_sim_params(self):
        # Save the simulation parameters in the local variable
        try:
            self.export_sim_param_frame_data()
            root = Tk()
            # self.__name__ = 'simulator_utilities'
            save_path = filedialog.askopenfilename(parent=root) # Ask user to file to save to
            print("__name__")
            print(__name__)
            print(self.sim_params.__module__)
            dill.dump(self.sim_params, open(save_path, 'wb'), byref=False)
            root.destroy()
        except:            
            AerialOptimizer.PrintException()
            print("Try saving again")
            root.destroy()

This code is part of a class called GUI() thats in the simulator_utiltites.py file, and the class SimulationParameters is also in the simulator_utilities.py file.

EDIT: So i tried a hacky solution right after posting this, and I really don't like this solution as it makes my code not portable to other computers, but i guess its okay for now. I just appended the directory that have simulator_utilities.py. Messy but oh well

            import sys
            sys.path.append('D:\Documents\DeepToot\RLBOT\simulator_bot')
thisismygitrepo commented 2 years ago

May I propose a portable solution for this: let the user be responsible (at pickling time) for specifying which paths to be included at unpickling time. For more flexibility, the user can pass a function that takes care of this path manipulation. The function could involves collapseuser, expanduser methods of Path for portability.

mmckerns commented 2 years ago

@thisismygitrepo: Do you want the user, during the dump, to specify the paths on both the source and target resources? If so, I'm not sure it's a good idea. It has the potential to make the pickles very resource specific. @leogama: FYI wrt #525, #527, etc.

thisismygitrepo commented 2 years ago

@mmckerns I have an object that I want to unpickle, but I can't achieve that before first changing directory to the repo from which it came from, or equivalently adding that to PYTHONPATH before unpickling.

Thus, I, the user, am already responsible for doing this. Why can't we just formalize this contract between the user and the library at dump time?

Lastly, I figured out a workaround

Could be too much for an average user.

mmckerns commented 2 years ago

Your "package workaround" and stated rules about extending the PYTHONPATH for "scripts" are exactly what I designed the interface to be. So, this is good (because installing it as a package essentially puts the module on the PYTHONPATH).

The design decision was that module dependencies need to be available somewhere on the PYTHONPATH. I think that's reasonable. I'm open to some setting that enables recursive module dependencies, or something similar, but it needs some discussion (and work if deemed worthwhile).

leogama commented 2 years ago

After the recent changes to the dump_module() (previously dump_session()), I'm not certain of what should dump(module) do for user modules. dump_module() is meant to save a module's state, not the module itself, so the assumption that the module should be available at loading makes sense.

dump(module) for user modules saves most of their contents but still saves classes by reference. Therefore the module also needs to be importable at loading.

Regarding the idea of doing some operation prior to loading a pickle, it's perfectly possible to create composed pickle files with two or more pickle streams. It may be used to implement some kind of "load hook". Using that early example:

import os
import dill
import foo

foo_path = os.path.dirname(foo.__spec__.origin)
def load_hook():
    import sys
    if foo_path not in sys.path:
        sys.path.append(foo_path)

my_file = "./foo.pkl"
with open(my_file, 'wb') as file:
    dill.dump(load_hook, file, recurse=True)
    dill.dump(foo, file)

At a different session in which foo is not in sys.path:

import dill
my_file = "./foo.pkl"
with open(my_file, 'rb') as file:
    load_hook = dill.load(file)
    load_hook()
    foo = dill.load(file)
    print(foo)
mmckerns commented 2 years ago

I know I'm restating this a bit, but to be clear... My expected usage pattern has been that if a module is to be dumped and shipped to be loaded elsewhere, that the module dependencies should either be already installed as packages, or the user can ship them with a service like pathos or ppft (both of which use dill.source), or install the packages dynamically with a service like pox. Shipping multiple scripts (i.e. uninstalled modules with dependencies that are not guaranteed to be on the PYHTONPATH), to me, starts to hedge into the territory of a package installers and the like. The load hook, as suggested by @leogama, is a very reasonable approach... and can be used to alter the PTYHONPATH, or install a module, or a number of other module support options.

leogama commented 2 years ago

I've just read the entire thread. Shipping whole modules is complicated because of dependencies and stuff. But there's already a partial solution for this in Standard Library that is zipped modules, which can hold entire packages in a single file. These can be imported directly with zipimport.

There's even an interesting possibility of concatenation a pickle file (e.g. from dump_module()) with a ZIP archive containing the same module whose state is saved in the pickle. Pickle streams may have any data appended to it, while ZIP archives may have any data prepended to it (they are read from the end!).

About the "load_hook" ideia, I'm thinking that the preload function's pickle stream should go after the object's stream, so that it would still read as a normal pickle file by load().

Such a file would follow some format like:

  1. Object's pickle stream
  2. Preload function's pickle stream
  3. "Footer" section containg the size of stream (2) in bytes, in a fixed size binary field.

To read the pickle executing the preload function before it, the reader must do roughly:

  1. Open the file
  2. Seek to the end (file size - integer field size)
  3. Read the size of stream (2) into N
  4. Seek back N bytes
  5. Load and call the preload function
  6. Seek to the beginning
  7. Finally, load the object

The stored stream size could be of stream (1) too. Maybe it's even better because it leaves space to change the file format in the future.

leogama commented 2 years ago

Thinking about this "preload hook", I think I'll move the file-related stuff (_open(), _PeekableReader and _TruncatableWriter) from the session submodule to _utils, and rename truncatable to seekable —seek and truncate capabilities seem to go together in the io module— before it is consolidated by the coming release.

And here is a draft of an API proposal for the "preload hook" mechanism:

import foo
import dill

# Create an object that needs some setup before unpickling.
obj = foo.SomeClass('example')

# Define the preload function.
def setup_foo()
    #code: set up things in foo for loading obj

with open('obj.pkl', 'wb') as file:
    dill.dump(obj, file, preload_hook=setup_foo)

In a different session:

import dill
with open('obj.pkl', 'rb') as file:
    obj = dill.load(file, exec_preload=True)  # load and call setup_foo(), then load obj

Alternatively:

import dill
with open('obj.pkl', 'rb') as file:
    obj = dill.load(file)  # just load obj, as if it was a common pickle file

Some important design aspects are:

leogama commented 2 years ago

Haha! Look at what the top-rated unanswered question in SO (https://stackoverflow.com/questions/44560416/pickle-a-dynamically-imported-class) is about.

Edit: also the second top-most: https://stackoverflow.com/questions/42613964/serializing-custom-modules-together-with-object-in-python

mmckerns commented 2 years ago

I think I'll move the... [snip]... before it is consolidated by the coming release.

@leogama: The reason we are still hung up is that we pulled an incomplete solution. When that's resolved or rolled back, we will release... we shouldn't be messing with anything else unless there's a good reason (e.g. the refactor is to correct a design flaw from leaking into a release).

leogama commented 2 years ago

@leogama: The reason we are still hung up is that we pulled an incomplete solution.

What are you referring to? #475 or #527?

mmckerns commented 2 years ago

i was referring to issues stemming from #507 and #526. (Nominally to be resolved with #527).