samvera / serverless-iiif

IIIF Image API 2.1 & 3.0 server in an AWS Serverless Application
https://samvera.github.io/serverless-iiif/
Apache License 2.0
69 stars 21 forks source link

Question - Local Caching #97

Closed codeclout closed 1 year ago

codeclout commented 1 year ago

Hi,

We have noticed several requests that miss the cache, for example image zoom requests. To mitigate overloading the (HTTP) origin with concurrent lambdas sending zoom requests, we are considering caching the source image at the edge or in memory.

Has the serverless IIIF community considered a local caching scenario to store files retrieved from an external source (jp2), for image viewer processing, to avoid generating dozens of requests to the source for each viewer operation?

Thanks, Brian

mbklein commented 1 year ago

I don't think I'd want this in serverless-iiif core, because it adds complexity to both the code and the deployment in a number of ways.

I know I probably sound like a broken record by now, but a viewer-request Lambda@Edge function is probably your best bet here, too. You'd could either provision your own source cache bucket, or use the one you already provisioned for >6MB responses, as long as you make sure to avoid naming collisions between cached source images and cached responses.

Extra bonus: It'd solve samvera/node-iiif#25 as well by eliminating the need for the custom HTTP stream resolver.

  1. Inspect the path of the incoming request and identify the source image needed.
  2. Check your source cache S3 bucket to see if the image is in the cache. If not, grab it from the upstream source and store it in the bucket.
    • The best way would be to use byte range requests and a multipart upload, but if your source images are smaller than 5 GB, a regular PutObject operation would work, too. It'd just be a bit slower.
  3. Add an x-preflight-location header to the request with the value s3://<cache_bucket>/<path/to/cached/image>. Might as well set x-preflight-dimensions too while you're at it.
  4. Let the S3 stream source resolver that's already built into serverless-iiif do the rest.

Set a lifecycle rule on the cache bucket to expire objects on day 1, and you're good to go. Unlimited ephemeral source caching with no changes to the core code.

codeclout commented 1 year ago

@mbklein - thank you.

We have implemented this using the Lambda@edge Origin Request trigger, passing the x-preflight-location header to the origin and have adopted your recommendations with slight modifications - noted for future readers of this post:

  1. Make a HEAD request answering if the source image (JP2 in our case) already exists in the S3 bucket
  2. If the answer is no, then make a request to the HTTP source to retrieve the image a. If the answer is yes set the necessary headers and exit
  3. Save it to the ephemeral storage (/tmp)
  4. Check the image size to determine how it should be uploaded (Managed or Multi-Part Upload) to the S3 source bucket
  5. On upload success set the x-preflight... headers to pass to the origin

Using the Origin Request trigger allowed for more flexibility with CPU, processing time and dependencies.

Thanks Again. Brian