rossant / ipycache

Defines a %%cache cell magic in the IPython notebook to cache results of long-lasting computations in a persistent pickle file
BSD 3-Clause "New" or "Revised" License
139 stars 35 forks source link

Alternative caching backends #28

Open dimatura opened 9 years ago

dimatura commented 9 years ago

Hi! ipycache is great, but one issue I've run into is that raw pickles are slow and big, specially for large arrays. In the past I've tried a bunch of alternatives (pickle+gzip, hdf5, etc). So I implemented a couple of these as alternative backends in ipycache here: https://github.com/dimatura/ipycache/tree/npyz. They all have tradeoffs, but I think overall something like this could be pretty useful overall. Any interest in a PR? I'd be willing to clean things up.

rossant commented 9 years ago

Sounds like a great idea! @ihrke would you be willing to review/merge a PR?

dimatura commented 9 years ago

One important issue is what "cons" would be acceptable. Right now I think using gzipped pickles is pretty painless, as it uses the stdlib and can accept anything picklable. joblib is a close second, that also can store anything picklable but works much better for arrays. It does add a dependency, so I guess there could be some conditional import logic there. bloscpack is currently my favorite for arrays in terms of speed/storage, but only works for arrays. Hickle (based on h5py) I wouldn't currently recommend as it's a bit hackish, though the idea is nice.

ihrke commented 9 years ago

Nice idea. I can do the reviewing/merging. How would we handle choosing the backend? We could either parse the provided filename, hand over an option to the cache-magic or allow the user to set it globally for a notebook. Personally, I would prefer a combination of the last two options. I agree that dependencies are an issue. Would be nice to keep the backends optional and fail with a graceful error in case of missing dependencies.

rossant commented 9 years ago

@ihrke +1 for all of these ideas, + filename extension parsing as well and fallback to cell-wise/global option.

ihrke commented 9 years ago

ok, so the hierarchy is:

  1. explicitly provided cell-wise option
  2. globally provided option
  3. filename parsing Meaning that a cell-wise option beats everything else and filename-parsing is the last fallback?
rossant commented 9 years ago

LGTM

dimatura commented 9 years ago

Yeah, that hierarchy looks good to me.

mforbes commented 9 years ago

Just throwing this out there, but I have a package (calling it persist right now) that allows one to archive objects using hdf5 for arrays etc.

https://bitbucket.org/mforbes/persist

The idea is to convert objects to executable source code in an importable module, putting large arrays in hdf5 files etc. as needed. This has some significant advantages over pickles in that the persistent archives are less likely to go stale (even if code changes, as long as the API is fixed, objects can be reloaded. Also, if things do break, the archive can be edited by hand to fix things). It also allows one to archive things that can't be pickled (such as functions). As long as one can write source code to specify the object, then it can be archived. One can define a custom representation by providing a single method get_persistent_rep().

I need to clean a few things up, but if this sounds useful, let me know and I will get it ready for release. It would be awesome to get this and issue #13 resolved so I can start using %%cache in a serious way.

ihrke commented 9 years ago

Looks interesting. We could support it as an alternative backend (under the same constraints as the others, i.e., graceful fallback in case the module fails to import etc). Let's wait for @dimatura 's PR before extending it with the persist module (in case it's functional by then).

den-run-ai commented 9 years ago

do cloudpickle and dill fall into this same category for ipycache backends? also why not have pickle protocol as one of options to cell magic?