richfitz / storr

:package: Object cacher for R
http://richfitz.github.io/storr
Other
116 stars 10 forks source link

slugs 🐌 #37

Open jennybc opened 7 years ago

jennybc commented 7 years ago

Related to a conversation in httrmock, which uses storrr to store recorded HTTP requests (https://github.com/gaborcsardi/httrmock/issues/3).

Have you ever contemplated a naming scheme that would allow the incorporation of an optional slug? Perhaps as a prefix.

cc @gaborcsardi

gaborcsardi commented 7 years ago

TBH I am not sure if this fits into storr, because we are essentially using it as a key value DB, and don't do queries on the data.

So the RDS backend is great (and storr is great in general!), but since we add the storage to the git repo, it would be nice to see which unhashed keys the individual storage files belong to.

I guess this would be another storr backend, no?

gaborcsardi commented 7 years ago

Or maybe just a prefix slug, as @jennybc suggested?

richfitz commented 7 years ago

I'm afraid I don't understand the issue well, or what the snail :snail: slug bits actually mean. But from looking quickly at the httrmock code, it's possible that the namespace functionality of storr could do this?

The idea is that one can do this:

st <- storr::storr_environment()
st$set("foo", "value1")
st$get("foo") # value1

but if you have many foo keys and want to keep them separated then you can use namespaces:

st$set("foo", "value2", namespace = "other")
st$get("foo") # value1
st$get("foo", "other") # value2

The default namespace is called objects, though this can be tweaked on storr creation with the default_namespace option. Practically, for the RDS storr, this keeps key/value lookups in different directories, but for other drivers it uses a prefix which is why I am thinking this might be appropriate here.

Let me know if I'm off the mark

gaborcsardi commented 7 years ago

Sorry for not being clear! I'll make another attempt.

Basically, my issue is that with the RDS store it is hard to follow which objects are changing in git, because the file names are just hashes.

In gh, which uses httrmock we store HTTP requests and responses, so that we can replay them at test time. We store them in an RDS storr, The key is usually just list(url = url, method = method), sometimes more. We put the RDS files in git, and also in the R package, so that the user is able to run the (mock) test cases. They look like this in gh:

tests/testthat/httrmock/data/0d2b78a225570cb97dda60ccf22d68f1.rds
tests/testthat/httrmock/data/27b277b1524f9894ba81145304f0f866.rds
tests/testthat/httrmock/data/b1d243dd06539ab9e51cafd39249c41e.rds
tests/testthat/httrmock/data/ba9158074556da2ddc5ccf596e31aa3b.rds
...
tests/testthat/httrmock/keys/objects/0ef0ee2f98a95501032c8dfbd9dc1461
tests/testthat/httrmock/keys/objects/14673dc1fde22d49bc845244b8b4bbfd
tests/testthat/httrmock/keys/objects/39ae4decc7dfce0e35feef7c547904d5
...

This is all nice, but when we develop the package and change/update the test cases, it is really hard to say which stored HTTP requests changed, because the filename is just a hash.

So, yes, namespaces are great! We can set the namespace of a request to be its URL, and that will make a nice grouping of the hundreds of gh tests that we'll have (if we eventually write them). It will also make it much clearer which objects change in git.

gaborcsardi commented 7 years ago

Is this all what we needed, @jennybc ?

richfitz commented 7 years ago

Hmm. Part of my personal use-case that inspired storr was to get de-duplicated storage and avoid hitting an external source by separating the key <-> value mapping into a key <-> hash <-> value mapping. I can see that's probably causing some pain here.

With the current driver design I can (probably) implement a direct key -> value mapping, which would probably make your use case simpler; let me know if that would be useful and I can try and get it done before the next CRAN release.

Another function you probably want to be aware of is the gc method. If you call that as your tests finish up then it will look for elements in data with no key pointing at them and delete them. It will invisibly return a vector of the deleted keys. That will be a good idea to run periodically or the database will grow without bound (this is a side effect of another design decision which was to make it relatively safe for multiple processes to read and write from the db at once)

gaborcsardi commented 7 years ago

@richfitz Thanks! Let us try the namespacing and we'll see if it indeed solves our problem.

jennybc commented 7 years ago

I am in a suboptimal time zone for this! But, yes, if @gaborcsardi thinks using the existing namespace functionality will work, that will be great. Since I assume httrmock will keep using RDS, then this will mean lots of directories? But that seems fine.

If I follow, we could end up using a new httrmock context and a new storr namespace for each test. In order to get the necessary control over request matching and to get human-readable names, respectively.

@gaborcsardi When I zoom out, it feels like there are two independent key-value systems being set up for the same set of objects, doesn't it? Maybe this is unavoidable, full stop. Or unavoidable if you want to take advantage of all the work @richfitz has already done in storr and not reinvent that particular wheel. Do I understand this right? The only way to simplify is to somehow make the object storage system build its keys based on the HTTP request.

gaborcsardi commented 7 years ago

Since I assume httrmock will keep using RDS, then this will mean lots of directories?

Yes.

If I follow, we could end up using a new httrmock context and a new storr namespace for each test. In order to get the necessary control over request matching and to get human-readable names, respectively.

I guess httrmock can just use the storr namespaces automatically, no human interaction is needed there. (I mean, I'll have to implement this.)

And we only need httrmock namespaces if HTTP method + URL do not determine the request+response completely. Also, I think I'll add the POST data and the query arguments to the key, and then explicit httrmock contexts will be rarely needed.

@gaborcsardi When I zoom out, it feels like there are two independent key-value systems being set up for the same set of objects, doesn't it? Maybe this is unavoidable, full stop. Or unavoidable if you want to take advantage of all the work @richfitz has already done in storr and not reinvent that particular wheel. Do I understand this right? The only way to simplify is to somehow make the object storage system build its keys based on the HTTP request.

Good points. Yes, I guess we could just give the (sanitized) request to storr and it would take care of everything, right?

Sorry, this is all a bit experimental...OTH I am excited to create sg that we use to test HTTP APIs properly.

jennybc commented 7 years ago

Sorry, this is all a bit experimental...OTH I am excited to create sg that we use to test HTTP APIs properly.

No need to apologize! It is pleasing to work through this. I think gh can be a pilot study for how this might work for other APIs.

jennybc commented 7 years ago

Just to make this discussion more concrete, here's my current git status in gh:

screen shot 2016-12-13 at 11 11 15 am

I have no clue which endpoints or tests or whatever all of this corresponds to. And it's not particularly easy to find out and decide to delete or stage and commit. This is the problem we're trying to solve.

gaborcsardi commented 7 years ago

Well, to be fair there is httrmock::list_recordings()..... but yes, we'll use namespaces....

jennybc commented 7 years ago

httrmock::list_recordings() is my dear friend, yes 😀.

richfitz commented 7 years ago

It might be easiest to use one opaque object, perhaps? Something like the DBI/SQLite storage that the development version of storr has (check out the develop branch, use storr::storr_dbi() with a RSQLIte connection object). Then from the point of view of the end user there is only a single object, though it will change pretty much every run there is a write.

gaborcsardi commented 7 years ago

it will change pretty much every run there is a write.

That's the thing, this is not ideal, you don't have a clue what is changing.

gaborcsardi commented 7 years ago

Namespaces are an improvement, but still not the real thing: https://github.com/gaborcsardi/httrmock/issues/4#issuecomment-266875686

richfitz commented 7 years ago

Thinking about this some more; if the major issue is getting things stored nicely in git, then it would be possible to use git itself as the object store (see #1) but this is not implemented because I thought it was a bit insane

gaborcsardi commented 7 years ago

I thought about a git backend as well, but a generic git backend for storr would look probably quite different than what we need here. A generic backend would probably use a separate git repo, and do commit automatically, but we don't want these.

richfitz commented 7 years ago

I was thinking of exploiting the git object store directly. But I suspect that what you need is simpler than this.

If it's useful I can get the direct key -> value mapping bits working; it should not take too long. That would basically do:

db$set(key, value, namespace) => saveRDS(value, file.path(root, namespace, key))
db$get(key, namespace) => readRDS(file.path(root, namespace, key))
db$list(namespace) => dir(file.path(root, namespace))

Or if you can implement your own in a very few lines of code

richfitz commented 7 years ago

@gaborcsardi; can you check out the direct branch of storr, and see if that helps you here. To use, run

storr::storr_rds(path, hash_algorithm = "NOHASH")

It will appear to work the same way as the storr object you have been using but drops the key -> hash -> value separation (just doing key -> value). Namespaces are still supported, as are all the methods you have been using.

This is experimental (as in, I wrote it on the train on the way in and have not fully tested it) but if it looks useful I can tidy it up.

gaborcsardi commented 7 years ago

@richfitz Thanks!

This is a big step towards our use case, but it is still not there unfortunately, because we often a different key and filename. This is simply because the key might contain a bunch of HTTP headers, or even the (hash of) the body of an HTTP POST request, but we obviously don't want these to show up in the filename. Sorry.

richfitz commented 7 years ago

I still not sure I follow all the constraints of your problem, but if you think that it'd be a generally useful thing I'd welcome a PR (or more details). It sounds like you need a level of indirection there, but I'm not sure I can guess how best to balance the set of competing requirements 😄

gaborcsardi commented 7 years ago

OK, sorry, I don't expect you to squeeze our use case in storr. :) I might create a PR, if it seems fitting, though.

richfitz commented 7 years ago

Nothing to apologise for - I just can't see my way through to the problems that you can see. There are heaps of trade-offs in designing these things, and at present storr models my needs closely. To be generally useful I'm interested in the problems that I can't guess at