richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

Cascading caching backends? #11

Closed russellpierce closed 6 years ago

russellpierce commented 8 years ago

I'm looking to contribute to a project that might grow towards cascading caching layers, e.g. ask local disk, if local disk doesn't have it, look to a Redis cache, if the Redis cache doesn't have it look to AWS S3, etc. I also like the memoization features of https://github.com/HenrikBengtsson/R.cache and would seek to have them be (eventually) present. I see that you are working on https://ropensci.org/; is this package going to be a part of that and acquire an open source license?

richfitz commented 8 years ago

This is done through the "external" driver at present. No real documentation to speak of, but there's an example in the tests and we use this here to use github releases to store things (caching to disk).

Memoisation is pretty easy to build on top of this; see for example here. Once things settle down (see below) I'll probably add a proper interface. I have held off though because it seems like fairly well trodden ground.

I've not tested cascading this past a lookup/storage driver as doing a 3 step cascade would require write access to the intermediate storage.

The package has an open source licence already; it is BSD 2 clause licenced. This package will probably end up on ropensci eventually yes, as I use it in a number of other packages that are likely to end up there.

Be warned; I'm in the process of refactoring this package to get things off to CRAN and to make it easier to add things to into and separating off some of the github stuff. Timeline is the next ~2 weeks I hope.

richfitz commented 8 years ago

OK, I've done a first pass at cleaning the package up. Documentation for building drivers is not really there because I think that the internals might get overhauled more completely (I've started on that offline to see where it goes).

What would be useful is to know:

The current implementation is complicated somewhat by my desire to have indexable serialisation (i.e., serialise a list and access it in pieces rather than just all at once). See here for details. But that could be made opt-in and then the interface simplifies.

The list of things that backends need to provide would still be quite large:

russellpierce commented 8 years ago

Short Version:

The only back ends that I'm not noticing in your current docs is a) an AWS S3 cache and b) a remote machine's storage. I think for both of these get/set/del/exists are a providable feature set in principle. I don't see any problem using a digest derived hash for creating keys/files in either use case.

I think a cache on a remote machine's storage is probably trivial with what you already have. In addition, for reasons I describe ad nauseum below, I have a short to medium-term unavailability to provide an S3 caching engine that can be supported on CRAN.

Provided a layer has the basics you describe (get/set/del/exists), then a basic cascading cache should be possible using storr caches without needing to change anything integral to storr. Therefore, I don't think there is any compelling reason to embed the cascading cache inside of an existing caching package. Instead a cascading cache could simply suggest and wrap external projects. Most of the features I imagine can be handled via metadata at the point of insertion and retrieval from the cache and therefore don't need additional engine features per se - but if they are handled by the engine, then it makes sense to expose them.

I probably shouldn't try to shoehorn this ill defined thing into your well oiled caching engine, I should just wrap what you already have and submit pulls here if I find an engine lacks a feature I need/want it to have. Given all of that, it is probably entirely reasonable to close this issue.

Long Rambling Version: My desire to have an AWS S3 backend is complicated by my unawareness of any CRAN package supporting AWS. Looking on github I can see that there has been some activity in that direction since last time I looked. Unfortunately, I'm already pretty deeply buried in use-case-specific adhoc stuff I've already written leveraging rPython. However, I doubt it will be ready for CRAN ever. I stomps all over the python namespace as if it is the only thing living there. I could fix that... or ignore it if I switch over to rJython. However, there are very few reverse dependencies on those two packages leaving me to think getting them approved on CRAN would be an uphill battle. The most obvious barrier I see to a clean user experience there is getting the user to install boto. In short, if I were to write a backend to S3 in the next year it almost certainly would be ugly and never pass muster with CRAN. That being said, I might be able to share it on github.

There are ton of different ways to handle a remote machine's storage. The first solution that jumps out at me as being able to support indexable serialization would be connecting to a running a single slave node SNOW cluster and having the result passed back from the SNOW slave node. There might be some advantage in that the SNOW slave node is then responsible for uncompressing the cache file and returning the result so that it is already serialized in RAM for passing back to the requesting master. The second solution, probably more practical one too, is just mapping the remote machine's cache directory to the localhost and using it as just another disk cache.

I'd already started writing a cascading cache in spaghetti functions before I stumbled across storr or really gave a hard look at R.cache. The features I have working right now in spaghetti functions are:

  1. On set, creating a limit on time to live for the cache (not allowing the result to be fetched if past a certain staleness threshold
  2. On put, setting the staleness threshold on cache read
  3. Being choosy about which layers to check on get
  4. Being choosy about which layers to push to on set
  5. Backpropagating to faster caches

What I still see as missing features are the ability to:

The above feature set goes beyond your get/set/del/exists support and would require metadata regarding the items time of storage, time until stale (specified on set), time of most recent access, and data size (as stored). However, it seems like a lot of extra functionality to ask from a cache most of it in the realm of metadata.

I started poking at the metadata problem here and there. Some of those items can simply be put as attributes on the stored item, but that requires fetching the full item before you discover that it is stale. Thus some sort of legitimate meta-data layer makes some sense. For obvious reasons, e.g. key expiration is already handled, Redis looks great as a meta-data layer. On the other hand, going round trip to a Redis server to get metadata (or requiring that one be installed) is a bit on the heavy/slow side. rrlite seems like a potential answer, but it reads like the Windows support issue is an open question (https://github.com/seppo0010/rlite/issues/11). That leaves SQLite, but I'm a bit less excited about that for no reason in particular.

All of that leads back to the short version. I probably shouldn't try to shoehorn this ill defined thing into your well oiled caching engine, I should just wrap what you already have and submit pulls here if I find an engine lacks a feature I need/want it to have.

richfitz commented 8 years ago

TTL is definitely a nice to have (see #5; open to ideas about implementing that). Storage space is much harder because there's a whole set of logic around that (better might be to implement at the driver level for cases where that's important).

For Amazon S3, there are few options but none seem very good:

russellpierce commented 8 years ago

I agree, I don't see a way around doing storage space at a driver level.

Re S3: AWS.tools and https://github.com/armstrtw/Rawscli (the follow on) both proceed via system calls to amazon's command line tool(s). Except Amazon keeps changing the interface on those pretty much on whim. So, a long lived R solution probably should (IMO) depend on a supported Python SDK. In particular boto3/botocore (relative to the hand coded boto) looks ripe for the type of procedural generation stuff you've done with the Redis API. In addition there is a forthcoming, but not yet stable, C++ SDK which seems a preferable build target relative to the community provided C++ link being leveraged by RS3.

richfitz commented 8 years ago

If you're still interested in writing an AWS backend, please see the new vignette: http://richfitz.github.io/storr/vignettes/drivers.html

This walks through creating a new driver, which is much easier than it was before. There's also an automatic test suite to reduce guesswork.

The new version is currently on the "refactor" branch as I check to see if I broke anything across various projects I have that depend on it. Aside from additional documentation I'm thinking this is pretty much good to go to CRAN now.

wlandau commented 7 years ago

What about AWS S3 support in the memoise package?

richfitz commented 7 years ago

An AWS backend could be done, for sure - would make most sense to use the package aws.s3 directly (storr drivers need way more than what memoise needs). That would would probably be best in another package (storr.aws perhaps).

Actually cascading things is another matter entirely; there's quite a bit of work needed to work out how things should cascade down. I can imagine a "caching" mode where you put one backend in front of another, but how do you keep changes propagating forward through the backends...

russellpierce commented 7 years ago

With no offense intended to the author of aws.s3 it is still very nascent compared to basic offerings Python or Java. In particular, the last time I looked very large file support was poor as was use of IAM creds. Personally I use Python's boto3 via package:reticulate. package:awsjavasdk [I authored] or package:AWR would also be workable.

wlandau-lilly commented 7 years ago

Also, it has been a while since this thread revisited a possible rOpenSci submission. Any new thoughts?

russellpierce commented 7 years ago

I've made some proof of concept stuff with reticulate and mostly addressed the issues caused for it by forking. So, S3 access outside of aws.s3 is within reach conceptually. I'd need to clean room rewrite into a public repo if I were to do it. As for storr integration I haven't been keeping track, but if I were to do it, then it would be after all of the above. For what it is worth, https://github.com/HenrikBengtsson/R.cache has the advantage of already being on CRAN.

russellpierce commented 7 years ago

I'm interested to hear where @richfitz is on this. I know I was very glad to see redux hit CRAN, I really enjoy his packages.

richfitz commented 6 years ago

@russellpierce - storr is on CRAN, though I think that really cascading through will require some thinking

russellpierce commented 6 years ago

We can probably close this issue. If we were to do cascading I'd probably see it as a package sitting on top of this one rather than increasing the complexity here.