joblib / joblib

Computing with Python functions.
http://joblib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
3.75k stars 413 forks source link

Consider offering Bloscpack as an alternative to ZFile #153

Open esc opened 10 years ago

esc commented 10 years ago

I have started doing some profiling again (it is that time of year) and I can see that ZFile can still be fairly slow compared to other solutions.

When I spoke to @GaelVaroquaux last year at EuroScipy, I proposed to refactor Joblib to allow using alternate, optional serializers for Numpy arrays. In particular I would like to support:

https://pypi.python.org/pypi/bloscpack/0.7.1

First of all I would like to enquire if there have been any refactorings to the Numpy serialization code? I have been peeking at the issue-tracker and saw that you guys have been hacking on it somewhat, but I can get a good big picture.

Secondly I would like to initiate a discussion on how to best support alternate serializers. Perhaps you already have some interface design ideas?

ogrisel commented 10 years ago

I think the compressed-read and write are still partially broken as they trigger useless memory copies. I have not yet found the time to debug though.

ogrisel commented 10 years ago

As for the interface I have not yet thought about how to make it plugable. We probably need to evolve the serialization format to store metadata on the compression algorithm used to save the data (without introducing a mandatory dependency on bloscpack by default).

esc commented 10 years ago

I think some initial notes are in:

https://github.com/joblib/joblib/blob/master/joblib/memory.py#L41

esc commented 10 years ago

It might be worth, changing the signature of Memory like so:

joblib.Memory(cachedir, mmap_mode=None, compress=False, verbose=1, codec='ZFile')
ogrisel commented 10 years ago

And similarly for joblib.dump. Also codec could be a Python object with a specific interface instead of a string and if left to Known, the a ZlibCodec instance would be created internally by joblib.

esc commented 10 years ago

For generalized pickling, I would also suggest to use:

https://github.com/hpk42/numpyson

But it still lacks compression, last time I checked:

http://nbviewer.ipython.org/urls/gist.githubusercontent.com/esc/144c802f50df0446294c/raw/c18c247419eabf9bc0aa9b4a2a767c320bf2a5b6/ipynb?create=1

esc commented 10 years ago

@ogrisel yeah, codec would either be a string or a Python object with a well defined interface.

GaelVaroquaux commented 10 years ago

For generalized pickling, I would also suggest to use:

dill would be very interesting, and should be too much work.

In terms of changing API, there is some reflection to be had on how to open the door to many variants (to garanty forward evolution) without making the interface too complex).

esc commented 10 years ago

Also, I would recommend to keep the dependencies small and make all alternate serializers optional.

esc commented 10 years ago

One idea might be to encapsulate mmap_mode, compress and codec into that object, that would however, break backwards compatability.

esc commented 10 years ago

Just to make sure we are on the same page, uncompressed save works via NPY format, right? And you avoided large memory copies with:

https://github.com/numpy/numpy/pull/4077

esc commented 10 years ago

So after some late night discussion with @GaelVaroquaux the consensus seems to be to hand in a boolean, an integer or an object with a given interface in to compress. That way we can keep existing code working and add new functionality w/o having to break public API.