Separate extensions for Picklers

eode commented 2 years ago

I'd like to add support for pickle extensions as a separate concept. So, for example, "foo.pkl.bz2" is gracefully handled, and "baz.json" is gracefully handled. I would be changing the default pickle_method arguments from "pickle" to "infer" -- but the behavior in the majority of circumstances would be the same.

But overall, this issue is a little more touchy than just adding basic JSON support, because it could cause API breakage. I suppose the core of what I'm asking is:

What level of Public-facing API change are you willing to accept?

I imagine changing the default pickler_method to 'infer' is fine.
I imagine adding new functions is fine
I imagine the API of dump, dumps, load, and loads should remain substantially the same.

However, what about repurposing or renaming other functions? Example:

get_registered_extensions should be repurposed or renamed, because its function is no longer fully clear if there are also pickler extensions
- for example, get_registered_extensions() could:
  - leave it as-is, but that would make the name somewhat deceptive
  - return all registered extensions, for both picklers and compressers, but that would require a different data format (because picklers will have more overlap in extension handling, so something like {"pkl": ["pickle", "dill", "cloudpickle"], "bz": ["bz2"]}
  - be renamed to get_registered_compresser_extensions
  - ..etc

What I would like to do with it is:

If a function's output changes, rename it to explicitly break compatibility (rather than implicitly by having different output)
If a function's name is no longer reflective of its purpose, or implies more than it does, rename it.
All functions that do effectively identical things for compressers and picklers would be named or renamed according to a common theme, like get_default_compresser_name_mapping and get_default_pickler_name_mapping.
- It might be easier to let the public API for registries be handled through module reference, like compressers.get_default_extension_map() and picklers.get_default_extension_map().

Anyways, those are my thoughts currently. In the mean time, I'm trying to have as little impact on the API as possible, to err on the side of caution.

eode commented 2 years ago

Also -- for any given extension, would you rather the first-registered PicklerIO class be the handler, or the last-registered one?

Last-registered means that if the user registers a class for some extension, it becomes the default for that extension. First-registered means that whatever our choice of default is, it stays the default unless the user deregisters all picklers and then registers their own.

lucianopaz commented 2 years ago

Thanks @eode. The feature you are proposing seems very nice.

I'd like to add support for pickle extensions as a separate concept. So, for example, "foo.pkl.bz2" is gracefully handled, and "baz.json" is gracefully handled. I would be changing the default pickle_method arguments from "pickle" to "infer" -- but the behavior in the majority of circumstances would be the same.

What should we do for "foo.bz2"? At the moment, it assumes that pickle is used, but with what you propose, this would fail to infer the serialization method, and raise an error, right?

But overall, this issue is a little more touchy than just adding basic JSON support, because it could cause API breakage. I suppose the core of what I'm asking is:

What level of Public-facing API change are you willing to accept?
* I imagine changing the default `pickler_method` to 'infer' is fine.

I wouldn't change it yet. I would set it to some value that then issues a DeprecationWarning. That way, we can let users know that the argument's default will change in a future release.

* I imagine adding new functions is fine

Yes, this is fine

* I imagine the API of `dump`, `dumps`, `load`, and `loads` should remain substantially the same.

I agree

However, what about repurposing or renaming other functions? Example:
* `get_registered_extensions` should be repurposed or renamed, because its function is no longer fully clear if there are also pickler extensions

  * for example, `get_registered_extensions()` could:

    * leave it as-is, but that would make the name somewhat deceptive
    * return _all_ registered extensions, for both picklers and compressers, but that would require a different data format (because picklers will have more overlap in extension handling, so something like {"pkl": ["pickle", "dill", "cloudpickle"], "bz": ["bz2"]}
    * be renamed to `get_registered_compresser_extensions`
    * ..etc
What I would like to do with it is: If a function's output changes, rename it to explicitly break compatibility (rather than implicitly by having different output) If a function's name is no longer reflective of its purpose, or implies more than it does, rename it. All functions that do effectively identical things for compressers and picklers would be named or renamed according to a common theme, like get_default_compresser_name_mapping and get_default_pickler_name_mapping. It might be easier to let the public API for registries be handled through module reference, like compressers.get_default_extension_map() and picklers.get_default_extension_map().

I like the idea of keeping the names at the module level for compressers and picklers (by the way, since we've added json support, eventually we probably should also rename picklers to serializers?). I agree that we should also change the name of get_registered_extensions at the package level. We should probably change this line to

get_registered_extensions as get_registered_compression_extensions

And then add a function that keeps the same name at the package level, saying that importing get_registered_extensions from compress_pickle is deprecated, and it should be changed to from compress_pickle.compressers import get_registered_extensions

Anyways, those are my thoughts currently. In the mean time, I'm trying to have as little impact on the API as possible, to err on the side of caution.

In all, I really like your suggestions.

To also answer your other comment, I prefer that the first registered PicklerIO gets precedence. We could maybe add a function called set_default_extension_handler or something like that, to help users change the pickler without having to unregister everything.

eode commented 2 years ago

Thanks @eode. The feature you are proposing seems very nice.

Good to hear.

What should we do for "foo.bz2"? At the moment, it assumes that pickle is used, but with what you propose, this would fail to infer the serialization method, and raise an error, right?
I imagine changing the default pickler_method to 'infer' is fine.
I wouldn't change it yet. I would set it to some value that then issues a DeprecationWarning. That way, we can let users know that the argument's default will change in a future release.

So, both of these might be addressed by making "infer" the default argument, and "pickler" the default serializer. Thus, when no serializer is specified/found by extension, pickler would still be used as is currently done. The only difference would be that if someone were using "mydata.json.bz2" as a pickle data file, then that would instead be interpreted as JSON (and fail to load).

Although on the surface that might seem like an implausible scenario, the number of times I've seen someone's data files that ended up being called "blahblah.xls.csv" or something like that is in fact, nonzero, and greater than one.

IMO, the chance is slight enough and the fix easy enough that it's worth the small potential risk of breakage. However, it's your baby, and I definitely defer to you on this. As it is, I'll assume deprecation warning unless the above changes your mind.

I like the idea of keeping the names at the module level for compressers and picklers (by the way, since we've added json support, eventually we probably should also rename picklers to serializers?)

I agree -- and perhaps leave functions in picklers that match the original functions/methods, but would create deprecation warnings. That way, if someone imports the picklers module, it'll still work.

That said, the project is called "compress_pickle", and JSON is kindof just a bonus at this point. :-)

get_registered_extensions as get_registered_compression_extensions
And then add a function that keeps the same name at the package level, saying that importing get_registered_extensions from compress_pickle is deprecated, and it should be changed to from compress_pickle.compressers import get_registered_extensions

Definitely.

Anyways, those are my thoughts currently. In the mean time, I'm trying to have as little impact on the API as possible, to err on the side of caution.

In all, I really like your suggestions.

Thanks! Good to hear.

To also answer your other comment, I prefer that the first registered PicklerIO gets precedence. We could maybe add a function called set_default_extension_handler or something like that, to help users change the pickler without having to unregister everything.

Good idea.

Alright, I'll be poking around at this off and on, and I'll get a PR up and running at some point so you can track changes.

eode commented 2 years ago

Ok. So, looking into this pretty deeply and doing some coding, I'm not sure I'll be able to pull this off without some pretty deep intrusion into the existing codebase. Just a forewarning that it's not likely to be a small PR, and it's somewhat hard to break up into chunks.

eode commented 2 years ago

Are you still interested in a PR for this even if it has a pretty big impact on the codebase?

lucianopaz commented 2 years ago

Yes, definitely @eode

lucianopaz / compress_pickle

Separate extensions for Picklers #43