internetarchive / iiif

The official Internet Archive IIIF service
GNU General Public License v3.0
21 stars 4 forks source link

Decide on Cantaloupe reverse proxy strategy #5

Closed digitaldogsbody closed 11 months ago

digitaldogsbody commented 1 year ago

We decided previously to use a reverse proxy in the nginx instance running inside the IA IIIF service container to allow images to be served from Canteloupe at https://iiiif.archive.org/image URLs.

Due to the nature of some of the identifiers needing to maintain URL-encoding (e.g rashodgson68%2frashodgson68_jp2.zip%2frashodgson68_jp2%2frashodgson68_0007.jp2), the configuration is not as trivial as a normal nginx reverse proxy setup.

The crux of the issue is that by default nginx unescapes the request URL (and so the above becomes rashodgson68/rashodgson68_jp2.zip/rashod....). If this is then passed to Cantaloupe, it is read against the <identifier>/<region>/<size>/<rotation>/<quality>.<format> route and thus produces an error response.

The way we get around this is to completely replace the URL that nginx processes after it has done it's block matching on the unescaped version. This is done by using a rewrite rule to replace the entire URL with $request_uri, which is the original request as made by the client, with no modification done by nginx (as opposed to $uri, which will be unescaped and have the path prefix removed).

However, once we're using a modified URL, the behaviour of proxy_pass changes. The default behaviour if you specify a normal proxy redirect (i.e proxy_pass https://example.com/service) is that the final URL to be sent gets re-escaped before being passed to the target server, and thus we end up with rashodgson68%252frashodgson68_jp2.zip%2A2fras....., which subsequently produces a 404 in Cantaloupe. (relevant ticket in nginx bug tracker: https://trac.nginx.org/nginx/ticket/727)

We can get round this by extracting the relevant parts of the original request URL into variables using regex, and building the final URL manually: proxy_pass https://example.com/iiif/$1/$2; however this requires a further workaround for nginx.

From the proxy_pass docs:

In some cases, the part of a request URI to be replaced cannot be determined:

  • When the URI is changed inside a proxied location using the rewrite directive, and this same configuration will be used to process a request (break)
    • When variables are used in proxy_pass

This effectively means that nginx will no longer resolve the proxy address at startup (normal behaviour, uses system resolver) but instead will construct it for every request (as it does not know if the final domain will be changed by the modified URL given in the directive).

The practical upshot of this is that we need a way for nginx to construct the URL of the Cantaloupe server at request time and there are two possible approaches:

1) By specifying the target server address with an IP, and using HTTP headers to indicate which domain we are requesting (PR: #2) 2) By providing a resolver for nginx to use at runtime to resolve the final URL once constructed (PR: #3)

(Other relevant reading: https://trac.nginx.org/nginx/ticket/1930 https://trac.nginx.org/nginx/ticket/2335 https://trac.nginx.org/nginx/ticket/2259)

digitaldogsbody commented 11 months ago

This is now complete and implemented into production and will be merged with post-meeting-fixes branch