Rich screenshot support via IIIF server layer

anjackson commented 3 years ago

If we wrap IIIF around the page screenshotter, we get a lot of the features we'll need, like easy specification of sizes etc, for different purposes.

To make this work, given the format of IIIF URIs, we could use PWID's and Base64 encode them. e.g.

urn:pwid:webarchive.org.uk:2008-11-29T00:41:42Z:page:http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/2008sigprops.aspx

Becomes...

dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4

Which we use as the identifier in the IIIF {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format} URLs, like this:

/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg

This uses the page level precision-spec, as this is what makes sense in this context. The prefix of the URL would have to be used to distinguish between the archived and crawl-time images.

/render/archive/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg
/render/capture/iiif/2/dXJuOnB3aWQ6d2ViYXJjaGl2ZS5vcmcudWs6MjAwOC0xMS0yOVQwMDo0MTo0Mlo6cGFnZTpodHRwOi8vd3d3Lmppc2MuYWMudWsvd2hhdHdlZG8vcHJvZ3JhbW1lcy9wcm9ncmFtbWVfcHJlc2VydmF0aW9uLzIwMDhzaWdwcm9wcy5hc3B4/0,0,200,200/full/0/default.jpg

This could be done by running a Cantaloupe IIIF image server, which wraps plain image servers nicely, is used by our partners, and has lots of nice features like handling caching. This would pass the Base64 PWID on to a modified webrender-puppeteer which would decode the pwid64 and render the page at full size and ideally at high resolution. Cantaloupe would then cache this output and handle generating all necessary derivatives.

Cantaloupe can also overlay e.g. the UKWA logo which might work quite nicely.

(We could also add http://labs.mementoweb.org/aggregator_config/archivelist.xml and use that to determine the right web archive endpoint for other archives.)

anjackson commented 3 years ago

Initial experimentation with Cantaloupe using a stock image,

https://github.com/ukwa/webrender-api/blob/d5d85c0f36d588fbf7046c44f10953c80a7508b1/docker-compose.yml#L47-L71

...and looks like it'll work nicely. Even with just URLs, e.g.

http://dev1.n45.wa.bl.uk:8182/iiif/2/https%3A%2F%2Fwww.webarchive.org.uk%2Fwayback%2Farchive%2F19950418155600%2Fhttp%3A%2F%2Fportico.bl.uk%2F/0,0,1366,1366/800,/0/default.png

i.e. the % encoding is needed but then it's okay. Simplest implementation would be to add an endpoint to webrender-api to unpack PWIDs so we can pass in the timestamp. We could allow direct or Base64 encoded forms. However, we need to talk to CDX to determine access rights.... So. The best idea is to have an internal API on ukwa-access-api that manages the PWID and limits access etc.

anjackson commented 3 years ago

The basic functionality was fairly straight-forward. For example (for those with access to DEV only right now):

The PWID has to be URL or Base64 encoded, so you can't pass e.g. urn:pwid:webarchive.org.uk:1995-04-18T15:56:00Z:page:http://portico.bl.uk/ in directly. Therefore, added a helper API that constructs the PWID and redirects to the IIIF endpoint. e.g.

IDEAS:

Implement test suite to cover new functionality
Update docs/README
Should parse/check PWID at the first pass through the API
Should also redirect to closest timestamp to reduce variation

ukwa / ukwa-services

Rich screenshot support via IIIF server layer #24