richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
117 stars 10 forks source link

Multi-Level storr? #86

Open AshesITR opened 6 years ago

AshesITR commented 6 years ago

Assume we have some data wich lives on a slow network drive, in a storr::storr_rds() and we need to read often from it and from multiple R sessions on the same computer.

In this scenario I thought about pulling the network backed storr once and storing it on disk

s_remote <- storr_rds("/slow/network/share")
s_local <- storr_rds("/local/ssd")
s_local$import(s_remote) # slow once
s_local$get(...)

Now each session has the default environment cache for speedup with multiple reads. Additionally, after calling import from one session, other sessions can read from a local SSD instead of a slow network share.

There are a few problems with this though, mainly staying in-sync with s_remote.

So what I thought could be useful would be some kind of multi-level caching, like

s <- storr_multi(master = storr_rds("/slow/network/share"), cache = storr_rds("/local/ssd"))

now s$set would do master$set and cache$set. s$get would retrieve the hash from master, check if it has the object in cache and return from there if found. Otherwise the object would be gotten from master$get and then written to cache for future readers.

This way re-reads of a file can be done from SSD once any R session has requested a particular key.

I do notice that cache actually needs no keystore (just hash -> object), so maybe there is a better way for the same feature to become available?

If the idea is worth trying, I'd be happy to try and code a PR.

Another possible interface to this feature could be the use_cache parameter wich currently only enables or disables an environment cache. This could be expanded to include other caches. It must be possible to re-use a previously used cache from another R session - ideally even simultaneously.

richfitz commented 6 years ago

This was previously discussed in https://github.com/richfitz/storr/issues/11

There are a few ways of doing this and the trick is in the details. I would be very open to a PR that handles this gracefully.

I would encourage you to think about implementing this in the style of multistorr which separates out key and data storage. This allows reuse of heaps of existing storr functionality without too much boilerplate.

The other similar bit of code (in fact it might basically be enough) is driver_remote. This is part completed work - the other half is in https://github.com/ben-gready/storr.remote/pull/2/files - however, there is some simple usage in the tests.

Apologies that neither of these features are too well documented!

AshesITR commented 6 years ago

I've spun up some concept code (still a lot to do for documentation and testing) Do you mind checking out the code in my fork? I'd be interested in your feedback on design, naming etc. Maybe we need more features? Also, kudos for the good testing harness. I could easily verify the basic stuff works :)

https://github.com/AshesITR/storr/commit/6c133f0e5767f6e8873cc280bc4c49cce87fa97d