Caching mechanisms - Githubissues

I suggest 2 levels of caching:

In memory cache of zipfile CDR, indexed by S3 URL
On disk cache of zip file contents, indexed by S3 URL and filename/path of inner zipped file It should be possible to use only one or both caches

Cache can be invalidated by ETAG header. We can issue HEAD requests to check invalidation, but that entails at least one request, so might as well read the CDR... So we need to use CDR from cache while avoiding requests. We can use standard HTTP caching headers (in case someone bothers to set them). Also, byte range GET requests will also return ETAG which we can use to potentially invalidate during read, which will cause the read to abort, CDR to be re-read and the read executed again - this is a bit cumbersome. This will only work if a filename is present in the CDR; in the case when a filename is not in the CDR we can either invalidate the CDR (if ppl ask for the file they probably know it should be there) or trigger HEAD request for potentially invalidating and causing another GET request.

In the case of on disk inner files cache, we need to consider CDR invalidation (which automatically invalidates all the inner files). We can trigger additional invalidation check using HEAD request.

I need to check if/how much HEAD requests are faster than GET requests for the CDR. If they aren't, ETAG won't be of much use and we need to settle for time based expiry

ozkatz / cloudzip

Caching mechanisms #7