investigate caching soutions

verbalhanglider commented 7 years ago

Starter questions

What options are available?
What are the system requirements for each option?
What memory do we need dedicated to caching for our purposes?

bnbalsamo commented 7 years ago

Generally speaking my thoughts here are thus:

Two major kinds of caches

1) Disk Caches - Good for not re-doing computationally intensive work that produces larger outputs, disk read speeds 2) RAM Caches - Very fast - expensive when it comes to storing larger things in here.

For disk caches I would recommend either 1) an object store, or 2) a mongo gridfs system. Cache on disk itself probably wouldn't be viable/desirable if we are dockerizing this, and we want other portions of the system to scale well.

For RAM caches I would recommend redis, it is the dominant product in that space right now, as far as I am aware, and we are already using it in other places.

RAM caching would probably be most useful if we leave the API navigable (see discussion #1) to reduce calls to disk and mimic a hierarchical structure, with keys being directory names, and their associated values being arrays of subnodes.

If we had a large amount of RAM we could potentially cache jpgs into it, but that might be getting greedy unless we want to devote a large amount of resources to the project. Otherwise caching jpegs into a GridFS system would utilize disk storage and leverage mongo optimizations for reads.

The one major benefit of redis here would be using key expirations to keep the cache "fresh". I would need to look more closely at disk based cache systems including GridFS to see if they have a similar functionality.

verbalhanglider commented 7 years ago

it sounds to me like ram cache with redis is the way to go.

but I would like to know more about the disk based cache just so we are covering our bases.

like you said in the meeting with John on Wednesday, a server with lots of ram is a potential powerful use for the machine recently purchased.

bnbalsamo commented 6 years ago

So - caching. Some more thoughts.

generally speaking: processor time is cheap, RAM is expensive, disk is slow, network is slow.

So, is caching jpgs in RAM really worth it to save us processing time recreating them, even if it incurs the network speed penalty if the redis server is remote? That would be the extreme case, but the same logic may still apply depending on bottlenecks if the redis server is local and processing is more easily scalable than RAM or vice versa.

There's also the issue of re-inventing the wheel here, which I hadn't much considered, but many web servers are configured to do some amount of caching themselves.

This SO post appears to deal with nearly exactly what we are dealing with, and I agree with both responses.

1) writing a caching proxy is probably something I shouldn't do from scratch if the big guys like nginx and apache are already doing it (wouldn't want application code stepping on web server toes) and also it would be a premature optimization

2) It would probably be a premature optimization until we see how the retriever performs in a production-esque environment

Addendum: If we do decide to go the nginx/apache route I should probably be made aware so I can check the cache specific headers on the responses from the webapp. I should probably have a look at doing this in general anyways, time permitted.

bnbalsamo commented 6 years ago

Discussion of caching appears to have been kicked down the road. Closing this issue for now.

uchicago-library / digcoll_retriever

investigate caching soutions #7