m3dev / gokart

Gokart solves reproducibility, task dependencies, constraints of good code, and ease of use for Machine Learning Pipeline.
https://gokart.readthedocs.io/en/latest/
MIT License
318 stars 57 forks source link

Use `dill` instead of `pickle` for processing `.pkl` files #354

Closed maronuu closed 6 months ago

maronuu commented 8 months ago

Introduce dill library as a serializer instead of pickle for all .pkl files.

gokart has its own file processors for various file formats. For .pkl files, we have used standard pickle library. However, it cannot handle a class or function whose metadata is dynamically determined when initialization.

For example, the following code is a class that update its own method run when initialization by using wrapper plus1. pickle library cannot handle such cases. Thus we introduce dill, which is built on pickle and can handle more various objects.

def plus1(func: Callable[[], int]) -> Callable[[], int]:
    @functools.wraps(func)
    def wrapped() -> int:
        ret = func()
        return ret + 1

    return wrapped

class A:
    run: Callable[[], int]

    def __init__(self) -> None:
        self.run = plus1(self.run)

    def run(self) -> int:
        return 1

cloudpickle is also another potential candidate, but in terms of longer history and more users, we adopt dill. Note that objects that can be serialized by pickle are also serialized by dill (https://dill.readthedocs.io/en/latest/#basic-usage ).

Compatibility

dill is a drop-in replacement for pickle. Existing code can be updated to allow complete pickling using:

As mentioned in doc, objects that can be serialized by pickle are serialized by dill. Additionally, we confirm the objects dumped by pickle are loaded via dill.load.

For the storage size, we confirm that the sizes of objects serialized by pickle or dill are the same.

kitagry commented 8 months ago

@maronuu Thank you for the suggestion!

I have some questions.

maronuu commented 8 months ago

@kitagry Thank you for the comment! I added some notes about the storage usage and compatibility in the PR description.

kitagry commented 6 months ago

Thank you!