uqfoundation / dill

serialize all of Python
http://dill.rtfd.io
Other
2.23k stars 177 forks source link

Discussion: cPickle #485

Open leogama opened 2 years ago

leogama commented 2 years ago

Hey, I've been working on a prototype of a "portable pickle" feature and started wondering whether it could eventually work with cPickle (or _pickle), the accelerator module written in C. I've also seen some comments in the code referencing the possibility that dill in general could use it.

However, after doing some investigation I noted that, even if the save_<type> functions are updated to use state setters (Py 3.8+) and don't rely on internal Pickler methods like save and save_reduce, there remains a big problem: cPickle will always save types and functions as globals (there isn't an internal dispatch table that we can modify as with the Python implementation).

I thought of creating a C extension module to play with cPickle internals, but all the C functions are private. The only remaining alternative I see is to have a complete clone of Pickler code with some minor tweaks to allow overriding these types' saving.

Have you also wondered about these issues? Do you think it is feasible to have a C extension compatible with multiple Python 3 versions? (I have no ideia how building and distributing these works.)

anivegesana commented 2 years ago

It is doable. You ship a precompiled binary for every single OS/Python version pair that you want to support. There aren't way too many combinations, just 20, plus the one for noarch for a fallback. The reason that I didn't bother thinking about it is because it would require some cleverness to build the C extension around the Python package in a way that would reduce code duplication but result in fast performance.

This causes an issue for people who want to use pickle5 in older versions of Python. The pickle5 package has a C library clone of pickle as well like you described, but it is not as trivial to get it working with a hypothetical C-dill compared to the current Python implementation.

If you would like to pursue this, I would say make a clone of cPickle that has all of its functions exposed and we can extend from it if it if the C extension is present.

The other option is a complete rewrite of dill to use dispatch_table instead of dispatch like cloudpickle did, but considering that people expect the dill internal API to be relatively stable (huggingface/datasets#4379), this is not a good idea and rebuilding our own cPickle is the way to go. We could also optimize dill functions while leaving the Python code put using Cython to get just a little bit more.

This only needs to be done for CPython. PyPy's pickle module is precompiled but runs within Python. @mmckerns Do you have stats on how many IronPython or Jython users dill has? I do not even know if there is support for them or if they are still maintained.

mmckerns commented 2 years ago

Do you have stats on how many IronPython or Jython users dill has? I do not even know if there is support for them or if they are still maintained.

As far as I know, they are both supported. The most detailed stats I track are here: https://pypistats.org/packages/dill, which doesn't give details on Jython/IronPython.

(EDIT: I should also mention these stats, which track total downloads by release version of dill. While a bit tangential to your question, they are worthwhile for seeing how many people are pinning to older release versions.)

anivegesana commented 2 years ago

I just noticed that they have no stable Python 3 builds, so they are probably implicitly unsupported now that Python 2 is no longer supported. The only binaries that need to be built are for CPython, which makes this problem much easier.

leogama commented 2 years ago

Just saw the cloudpickle source. I was wrong, the default saving for functions and classes can be changed with the reducer_override method of Pickler subclasses from Python 3.8 on —I don't know why I thought it was for user classes, since the already have __reduce__, etc. But, crucially, dictionaries saving cannot (but maybe dill could use persistent_id for __main__.__dict__ and other edge cases).

anivegesana commented 2 years ago

Well, reducer_override can be helpful if you need to need something that works on all subclasses of a certain type (like type in the dill codebase), but it definitely isn't general enough. It is possible to turn the postprocessing mechanism into reductions. But as you said, dictionaries (and sets and frozensets after I finish #19) will prevent doing cloudpickle's approach. Using persistent_id will prevent the end user from using that feature (unless you can come up with something clever,) so that isn't too helpful.

mmckerns commented 2 years ago

I remember looking at the cloudpickle code for _pickle (here), when they managed to get some changes into python on a PR with no associated PEP (sometimes this happens). To do the same, I think it'd be a full rewrite. I deferred solving this problem until later -- which turns out maybe the time is now. The speed difference between pickle and _pickle is pretty significant.

leogama commented 2 years ago

But as you said, dictionaries (and sets and frozensets after I finish #19) will prevent doing cloudpickle's approach.

Oh, you are mentioning the "deterministic" feature, right? Would this be an optional feature? Because since Python 3.7, the insertion order for dictionaries and sets is preserved, so it should also be preserved in pickling. If sorting the keys or items is dics/sets is only used for standalone objects, it could be implemented as a special case in dill.dump(s) instead of modifying the saver.

Using persistent_id will prevent the end user from using that feature (unless you can come up with something clever,) so that isn't too helpful.

This would be just for __main__.__dict__ and the like.

I'm considering that if it is possible to use standard _pickle, even at the cost of not implementing some minor new features, it would be the best option for maintenance. And even if someday the need arises to implement a _pickle.Pickler clone, it will be better to have most logic in Python and little to no changes to base code, just extra functions.

leogama commented 2 years ago

@mmckerns, it seems to be a lot going on in dill's development plans. I, personally, was hooked by the challenges put (I was in need of a new puzzle and love to learn more about Python 😉) and I'm committed to work on these if you feel my contributions are valid —I've sent some PR, mostly drafts, after the session bugfixes, but there's no urge about them.

However, what do you think of opening a board in GitHub "Projects", a wiki page or even an issue with a to-do list to track all these parallel efforts? I'm beginning to get lost between so many issues and PRs (some open, others closed...). There is currently:

mmckerns commented 2 years ago

I've managed large software projects, and found that the only/best way to track issues/effort is to have a tool that is integrated with the tickets/PRs. So, a project view in GitHub could make sense. As a lightweight step, I've found that adding a milestone to the ticket/PR can make it easy to track what is intended for the next pending release.

There is always a lot of development to be done -- just not always the time to do it. :) I'd really like to get dill.source working in Python/Jupyter again, better cPickle integration, and deterministic pickles. I've updated the packaging, and that's mostly done aside from some mild annoyances. Next on my agenda is to drop 2.7 support.(DONE)

anivegesana commented 2 years ago

Oh, you are mentioning the "deterministic" feature, right? Would this be an optional feature? Because since Python 3.7, the insertion order for dictionaries and sets is preserved, so it should also be preserved in pickling. If sorting the keys or items is dics/sets is only used for standalone objects, it could be implemented as a special case in dill.dump(s) instead of modifying the saver.

Yes it is an optional feature, but very popular packages like pulumi and huggingface datasets need it, so we can't just ignore it. The insertion order for dictionaries is only for user create dictionaries. Global variable dictionaries are still non-deterministic, so the feature is still needed. Insertion order preservation is not guaranteed for sets, never was, and isn't true in CPython. It would be cool and probably possible if we could just expose the _pickle save functions if you want to make a PR to Cpython.

This would be just for __main__.__dict__ and the like.

I'm considering that if it is possible to use standard _pickle, even at the cost of not implementing some minor new features, it would be the best option for maintenance. And even if someday the need arises to implement a _pickle.Pickler clone, it will be better to have most logic in Python and little to no changes to base code, just extra functions.

persistent_id is used here:

https://github.com/python/cpython/blob/3.7/Lib/pickle.py#L489-L492

Only one self.persistent_id can exist for any Pickler and if we override it, users would need to make sure to extend our self.persistent_id, which would break packages like torch (https://github.com/pytorch/pytorch/blob/8449ac770c29a52257a3f2842c1dfd06d2c2a497/torch/serialization.py#L538-L583) who have their own mechanism to use persistent_id which might cause collisions with our definition because the key could be arbitrary.

When optimizing remember the most important rule: safety first. Do not mix adding new code (adding persistent_id) with optimizing (offloading to _pickle.) Do not make any modification that might in any circumstance modify the behavior of the code when optimizing. Break them up into separate PRs so it is easy to reason about each optimization and show that nothing changed, or it will inexplicably break people's code even though all of the tests pass. Even adding a new feature that shouldn't change how anybody uses dill can cause very strange bugs that are difficult to trace (#483). I would recommend making a separate module for the _pickle implementation like cloudpickle did to not cause those sorts of issues. If anyone wants to use the _pickle module, they can opt in.

leogama commented 2 years ago

Some thoughts about persistent_id:

anivegesana commented 2 years ago

A solution like that would be hairy and awkward. The hash idea avoids collisions in the sense that it is possible to know if the object belongs to dill or not, but it doesn't avoid collisions in the way that I described in torch where it would be impossible to distinguish if the persistent id belonged to dill or was a tensor that was just missing from the model and should result in an error, so they would need to add an if statement to their code to delegate to dill's persistent id if dill is installed and the id belongs to dill. And everyone using persistent id would have to do that. Although dill is a popular package, I don't think we can beat the entire Python community into using our custom conventions for pickling without scaring most people off.

mmckerns commented 2 years ago

I don't think we can beat the entire Python community into using our custom conventions for pickling without scaring most people off.

You can't do that. It is possible however, to provide something super useful and easy to integrate so that developers choose to use it to get super nice additional functionality. I'd add an if statement to my hashing code if it meant that I could get a 10x speed up on serialization...

leogama commented 2 years ago

Since I'm not a professional developer, but rather an academic that happens to develop library code occasionally, I'll guide myself 99% by your experience.

I'm just throwing stuff in the wall here before dedicating any time to prototype solutions. (By the way, the "portable mode" prototype is working! I'll create a draft PR until tomorrow.)


Another wild idea I had:

Is it possible to override the __getattribute__ (or __getattr__) method of dill (or _dill) so that it can retrieve some random object, link a module's dictionary, when called via find_class? This wouldn't clash with other packages.

anivegesana commented 2 years ago

Why not use _getattribute?

https://github.com/python/cpython/blob/9369942054fe3fe389f4f4ff808d33c5d7945052/Lib/pickle.py#L322

Of course, we should add a copy to dill in case CPython gets rid of it because it is private.

It would be less efficient, but we could create an optimizer that smooths over it. In either case, I don't think there is an easy way to force _pickle.Pickler to emit the global instruction without somehow dispatching to Python code or cleaning it up in optimization.