Open dimatura opened 9 years ago
Sounds like a great idea! @ihrke would you be willing to review/merge a PR?
One important issue is what "cons" would be acceptable. Right now I think using gzipped pickles is pretty painless, as it uses the stdlib and can accept anything picklable. joblib is a close second, that also can store anything picklable but works much better for arrays. It does add a dependency, so I guess there could be some conditional import logic there. bloscpack is currently my favorite for arrays in terms of speed/storage, but only works for arrays. Hickle (based on h5py) I wouldn't currently recommend as it's a bit hackish, though the idea is nice.
Nice idea. I can do the reviewing/merging.
How would we handle choosing the backend? We could either parse the provided filename, hand over an option to the cache
-magic or allow the user to set it globally for a notebook. Personally, I would prefer a combination of the last two options.
I agree that dependencies are an issue. Would be nice to keep the backends optional and fail with a graceful error in case of missing dependencies.
@ihrke +1 for all of these ideas, + filename extension parsing as well and fallback to cell-wise/global option.
ok, so the hierarchy is:
LGTM
Yeah, that hierarchy looks good to me.
Just throwing this out there, but I have a package (calling it persist
right now) that allows one to archive objects using hdf5 for arrays etc.
https://bitbucket.org/mforbes/persist
The idea is to convert objects to executable source code in an importable module, putting large arrays in hdf5 files etc. as needed. This has some significant advantages over pickles in that the persistent archives are less likely to go stale (even if code changes, as long as the API is fixed, objects can be reloaded. Also, if things do break, the archive can be edited by hand to fix things). It also allows one to archive things that can't be pickled (such as functions). As long as one can write source code to specify the object, then it can be archived. One can define a custom representation by providing a single method get_persistent_rep()
.
I need to clean a few things up, but if this sounds useful, let me know and I will get it ready for release. It would be awesome to get this and issue #13 resolved so I can start using %%cache
in a serious way.
Looks interesting. We could support it as an alternative backend (under the same constraints as the others, i.e., graceful fallback in case the module fails to import etc).
Let's wait for @dimatura 's PR before extending it with the persist
module (in case it's functional by then).
do cloudpickle and dill fall into this same category for ipycache backends? also why not have pickle protocol as one of options to cell magic?
Hi! ipycache is great, but one issue I've run into is that raw pickles are slow and big, specially for large arrays. In the past I've tried a bunch of alternatives (pickle+gzip, hdf5, etc). So I implemented a couple of these as alternative backends in ipycache here: https://github.com/dimatura/ipycache/tree/npyz. They all have tradeoffs, but I think overall something like this could be pretty useful overall. Any interest in a PR? I'd be willing to clean things up.