loris-imageserver / loris

Loris IIIF Image Server
Other
209 stars 87 forks source link

Use Entity Tags in HTTP resolver to cache source images #281

Open scossu opened 7 years ago

scossu commented 7 years ago

This ticket is to implement an entity tag (ETag)–based cache in the HTTP resolver (https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.26)

Not all HTTP servers support this feature so this should be a configurable option.

Proposed implementation :

Depending on coding convenience, this may be a better fit for a separate resolver (subclass of SimpleHTTPResolver). In that case the current caching mechanism can be bypassed completely in favor of this.

A purge function could be implemented separately, in this case Loris would have to delete the UID-ETag pair with the cached image in order to be able to fetch the content again.

alexwlchan commented 6 years ago

I would very much like to get this working, as we’re seeing a small number of 500s from our Loris instance which are triggered by the SimpleHTTPResolver source dropping the connection. If we have ETag caching, we’d reduce the load on our HTTP source.

The simplest thing would be to store the ETag in a JSON file; something like:

# etag.json
{
    "source": "https://private.myhttpsource.org/V1234.jpg",
    "value": "0123456789abcdef"
}

which lives in the HTTP resolver cache alongside the image itself. (The HTTP resolver cache has a directory per image.)

So the logic for fetching an image becomes something like:

if image_is_in_cache:
    old_etag = load_etag_from_json_cache()
    new_etag = get_etag_from_head_request()
    if (
        (old_etag.source == new_etag.source) and
        (old_etag.value == new_etag.value)
    ):
        return cached_image()
    else:
        fetch_image()
else:
    fetch_image()

I’d probably tweak the logic to shortcut the HEAD request if you know the ETags aren’t going to match (e.g. if you don’t have a cached ETag), but it gives the general idea.

What do other people think of this suggestion? I’m particularly interested in @bcail and @scossu’s thoughts, but other opinions welcome.

I won’t write or deploy this before the New Year, but it’ll probably be near the top of my todo list when I get back.

bcail commented 6 years ago

So the goal here is for Loris to automatically update its cached source images, by checking the source http server for an update on each request? I think Loris currently just checks for whether a source image is in the cache - if it is, it uses it, and doesn't hit the source http server at all.

If we go this direction, I like the idea of having a configuration option - that would let users turn if off if the source server doesn't have ETags or if they don't want the performance hit of so many requests to the source server.

scossu commented 6 years ago

We implemented this in production for a period of time: https://github.com/aic-collections/loris/commit/3e3a67372fa11aac3373796acc87a94db7f227a5