ozkatz / cloudzip

list and get specific files from remote zip archives without downloading the whole thing
Apache License 2.0
130 stars 5 forks source link

Caching mechanisms #7

Open nukemberg opened 4 months ago

nukemberg commented 4 months ago
          > probably worth caching the zipfile index in memory, to make repeated calls faster. This probably requires refactoring the central directory parser. @ozkatz wdyt

Sure - we can do that in a another PR. I'm assuming you mean also introducing some form of invalidation? because as suggested, we can just parse before serving the server and assume the directory/zip contents are immutable (which cz already assumes elsewhere)

Originally posted by @ozkatz in https://github.com/ozkatz/cloudzip/issues/6#issuecomment-2061852575

nukemberg commented 4 months ago

I suggest 2 levels of caching:

Cache can be invalidated by ETAG header. We can issue HEAD requests to check invalidation, but that entails at least one request, so might as well read the CDR... So we need to use CDR from cache while avoiding requests. We can use standard HTTP caching headers (in case someone bothers to set them). Also, byte range GET requests will also return ETAG which we can use to potentially invalidate during read, which will cause the read to abort, CDR to be re-read and the read executed again - this is a bit cumbersome. This will only work if a filename is present in the CDR; in the case when a filename is not in the CDR we can either invalidate the CDR (if ppl ask for the file they probably know it should be there) or trigger HEAD request for potentially invalidating and causing another GET request.

In the case of on disk inner files cache, we need to consider CDR invalidation (which automatically invalidates all the inner files). We can trigger additional invalidation check using HEAD request.

I need to check if/how much HEAD requests are faster than GET requests for the CDR. If they aren't, ETAG won't be of much use and we need to settle for time based expiry