cache parsed neuron data

natverse / rcatmaid

R package providing API access to the CATMAID web image annotation tool

https://natverse.github.io/rcatmaid

GNU General Public License v3.0

9 stars 6 forks source link

cache parsed neuron data #20

Open jefferis opened 9 years ago

jefferis commented 9 years ago

not sure of a good strategy yet for this? One simple thing would be to hash the returned json and at least save ourselves the trouble of re-parsing.

as a little test, something like this:

read.neurons.catmaid(<42pnids>)

breaks down to about

20% GET
20% parse json to R list of lists
20% list2df (i.e. parse json list structures to data.frames)
40% parse data.frame to neuron

So it looks like this strategy could give a 3-5x speedup, which sounds interesting. But then the question is where would do this. If we insert something in catmaid_fetch we could make something very general and save the json parsing. But if we worked with read.neuron.catmaid, we should be able to save everything.

Another strategy would be to cache the request itself – this could involve catmaid_fetch again and a hash of the url/post data along with some kind of timestamp checking.

jefferis commented 9 years ago

some more thoughts on above. It seems to me that both types of caching

request caching
parsed result caching

would be interesting.

For the request caching, I would think one should basically create a directory hierarchy from a root directory specified by an option

options(catmaid.cache.root=TRUE)

TRUE => something like rappdirs::user_cache_dir("rpkg-catmaid")

that matches the request url e.g.

"rpkg-catmaid/<rooturl>/1/10418394/0/0/compact-skeleton"

underneath that there should be an rds object named by the md5 hash of the content (or perhaps the etag).

One could then imagine having a second option

options(catmaid.cache.expiry=3600)

which sets the cache expiry time in seconds.

jefferis commented 9 years ago

For the parsed result caching something like md5 of raw contents as dir and then function name (match.call()[[1]]) as file name would be an option.

jefferis commented 7 years ago

I have noticed a couple of options for this, but nothing looks perfect so far.

https://github.com/nealrichardson/httpcache provides replacement GET/POST commands to replace httr equivalents. However since we need to be able to change cacheing behaviour at runtime, we need some more control. Furthermore we need to use the cachedPOST function because the regular POST function is assumed to change state.
https://rud.is/b/2017/08/22/caching-httr-requests-this-means-warc/

Option 1 has the big advantage of being on CRAN. Either option may need

some special logic in catmaid_fetch
an argument about whether cacheing should be allowed (set by the callee – not always possible to tell the logic of POST requests that are extensively used by CATMAID API).
option to control location of cache/invalidation etc.

jefferis commented 5 years ago

See #119 for a cache / mock testing approach