Open esc opened 10 years ago
I think the compressed-read and write are still partially broken as they trigger useless memory copies. I have not yet found the time to debug though.
As for the interface I have not yet thought about how to make it plugable. We probably need to evolve the serialization format to store metadata on the compression algorithm used to save the data (without introducing a mandatory dependency on bloscpack by default).
I think some initial notes are in:
https://github.com/joblib/joblib/blob/master/joblib/memory.py#L41
It might be worth, changing the signature of Memory
like so:
joblib.Memory(cachedir, mmap_mode=None, compress=False, verbose=1, codec='ZFile')
And similarly for joblib.dump
. Also codec
could be a Python object with a specific interface instead of a string and if left to Known, the a ZlibCodec
instance would be created internally by joblib.
For generalized pickling, I would also suggest to use:
https://github.com/hpk42/numpyson
But it still lacks compression, last time I checked:
@ogrisel yeah, codec
would either be a string or a Python object with a well defined interface.
For generalized pickling, I would also suggest to use:
dill would be very interesting, and should be too much work.
In terms of changing API, there is some reflection to be had on how to open the door to many variants (to garanty forward evolution) without making the interface too complex).
Also, I would recommend to keep the dependencies small and make all alternate serializers optional.
One idea might be to encapsulate mmap_mode
, compress
and codec
into that object, that would however, break backwards compatability.
Just to make sure we are on the same page, uncompressed save works via NPY format, right? And you avoided large memory copies with:
So after some late night discussion with @GaelVaroquaux the consensus seems to be to hand in a boolean
, an integer
or an object with a given interface in to compress
. That way we can keep existing code working and add new functionality w/o having to break public API.
I have started doing some profiling again (it is that time of year) and I can see that ZFile can still be fairly slow compared to other solutions.
When I spoke to @GaelVaroquaux last year at EuroScipy, I proposed to refactor Joblib to allow using alternate, optional serializers for Numpy arrays. In particular I would like to support:
https://pypi.python.org/pypi/bloscpack/0.7.1
First of all I would like to enquire if there have been any refactorings to the Numpy serialization code? I have been peeking at the issue-tracker and saw that you guys have been hacking on it somewhat, but I can get a good big picture.
Secondly I would like to initiate a discussion on how to best support alternate serializers. Perhaps you already have some interface design ideas?