sciencehistory / chf-sufia

sufia-based hydra app
Other
9 stars 4 forks source link

IIIF #312

Closed hackartisan closed 7 years ago

hackartisan commented 7 years ago

@jrochkind and I have a better sense of what it would mean to do iiif, and what we get by doing iiif, after hydra camp. We'd like to start this discussion by spelling out our requirements very specifically, translating the requirements into design (how a viewer, etc, should impact the page layout), and then assessing possible strategies (including iiif) for fulfilling the requirements / design.

MDiMeo commented 7 years ago

Image viewer functionality -- REQUIRED:

Image viewer functionality -- DESIRED:

Initial Design/Functionality Thoughts and Questions:

hackartisan commented 7 years ago

@MDiMeo Does page turner mean side-by-side images? Or does it just mean the ability to progress from one to the next?

hackartisan commented 7 years ago

It might be helpful to analyze our data for a distribution of sizes of our tiffs. Our largest image is, I think, 1.2G.

jrochkind commented 7 years ago

@MDiMeo I'm not sure what some of those things mean, like "page turner" either.

I think we should probably start from user requirements like "must be able to view a multi-page book in a reasonable way", and move on to wireframe UI/UX diagrams of how we might do that.

"Numbers or pagination of some kind for multi-file works" -- I don't think we have the metadata to support this. We can tell you it's image 12 out of 84 scanned images, but we can't give you original pagination without metadata I don't believe we have. (For example https://hydra.chemheritage.org/concern/generic_works/1831ck38c, where the page with internal self-contained pagination of 1 is actually the 9th scanned image, because there is a scan of the cover, followed by 7 more pages of preferatory material not given page numbers in the original text, and only then "page 1", which also doesn't actually have a page number on it, but the next one is - 2 - :) )

as far as "distribution of sizes of TIFFs", keep in mind that in no case would we be delivering a TIFF to the browser for on page view (browsers can't even display them). Turning an (typically uncompressed) TIF into a JPG of equivalent resolution (which is compressed both losslessly and usually lossily too) results in drastic file size reduction. In one example I did, a 100MB TIFF turned into a 3.8MB JPEG.

jrochkind commented 7 years ago

Also, we're going to have design work to do even if we were to go with an existing IIIF-based JS viewer. No matter what path we choose, I don't think this is going to be something where we just "turn on the IIIF and the JS viewer" and we're done.

We need to decide how/where to incorporate it into our site, and probably do additional design and development to build out all those features based on it, if all those features are really requirements. They may not all be "built-in" to existing viewers, but viewers are designed to be customized and composed.

jrochkind commented 7 years ago

Additional things to maybe consider (or not).

MDiMeo commented 7 years ago

I was using "page turner" as a shorthand for book reader because that's the term I've heard more frequently in digital library circles. Here's an old Hydra thread called page turner(!), but it's really about ways to view a multi-file work like a book: https://groups.google.com/forum/#!topic/hydra-tech/huLHB1qO-4I It's this need we must address, and, no, I don't think side-by-side is a requirement (though I used to. I feel differently now and wonder if others agree.)

We have discussed adding folio/page number metadata in the future if this feature becomes available. The viewer at the Wellcome allows you to toggle between Image number and Page number, and if there is no page number metadata then it is just blank.

And yes, the PDF download for an entire work is listed under my desired items, too! I definitely think this would help researchers and I've used this feature on other digital libraries.

In terms of functionality, I think everything I want is answered by the Wellcome's viewer: https://wellcomelibrary.org/item/b19703181#?c=0&m=0&s=0&cv=3&z=0.2396%2C0.1636%2C0.5209%2C0.3272&r=0 But we do have some other links and designs that might be helpful for comparison.

jrochkind commented 7 years ago

Sorry, I'm not really good at being concise, I always want to tell the full story! Here's a memo on some technical/operational considerations with IIIF/riiif. We're not neccesarily ready to talk about this now, design considerations come first, but I wanted to capture it while it was fresh in my head. I don't know if @MDiMeo really needs to read this, but I would (at the right time) greatly appreciate @sanfordd's thoughts.

Memo on Technical Operational Considerations for IIIF in a Sufia/Hyrax app

11 May 2017

IIIF (International Image Interoperability Framework) is a standard API for a server which delivers on-demand image transformations.

What sort of transformations are we interested in (and IIIF supports)?

@jcoyne has created Riiif, an IIIF server in ruby, using imagemagick to do the heavy-lifting, that is a Rails engine that can turn any an IIIF server. In addition to it being nice that we know ruby so can tweak it if needed, this also allows it to use your existing ruby logic for looking up original source images from app ids and access controls. It's unclear how you'd handle these things with an external IIIF server in a sufia/hyrax app; to my knowledge, nobody is using anything but riiif.

Keep in mind that the reason you need tiled image source is only when the full-resolution image (or the image at the resolution you desire to allow zoom to) in a JPG format is going to be too large to deliver in it's entirety to the browser (at least with reasonable performance). If this isn't true, you can allow pan and zoom in a browser with JS without needing a tiled image source.

And keep in mind that the primary reason you need an on demand image transformation service (whether for tiled image source or other transfomrations), is when storing all the transformations you want is going to take more disk space than you can afford or is otherwise feasible. (There are digital repositories with hundreds of thousands or millions of images, each which need various transformations).

There is additionally some development/operational convenience to an on-demand transformation aside from disk space issues, but there is a trade-off in additional complexity in other areas -- mainly in dealing with caching and performance.

The first step is defining what UI/UX we want for our app, before being able to decide if an on-demand image transformation server is useful in providing that. But here, we'll skip that step, assume we've arrived at a point from UI/UX to wanting to consider an on-demand image transformation service, and move on to consider some operational issues with deploying RIIIF.

Server/VM seperation?

riiif can conceivably be quite resource-intensive. Lots of CPU taken calling out to imagemagick to transform images. Lots of disk IO in reading/writing images (effected by cache and access strategies, see below). Lots of app server http connections/threads taken by clients requesting images -- some of which, depending on caching strategies, can be quite slow-returning requests.

In an ideal scenario, one wouldn't want this running on the same server(s) handling ordinary Rails app traffic, one would want to segregate it so it does not interfere with the main Rails app, and so each can be scaled independently.

This would require some changes to our ansible/capistrano deploy scripts, and some other infrastructure/configuration/deploy setup. The riiif server would probably still need to be deployed as the entire app, so it has access to app-located authorization and retrieval logic; but be limited to only serving riiif routes. This is all do-able, just a bunch of tweaking and configuring to set up.

This may not be necessary even if strictly ideal.

Original image access

The riiif server needs access to the original image bytestreams, so it can tranasform them.

In the most basic setup, the riiif server somehow has access to the file system fedora bytestreams are stored on, and knows how to find a byestream for a particular fedora entity on disk.

The downsides of this are that shared file systems are... icky. As is having to reverse engineer fedora's file storage.

Alternately, riiif can be set up to request the original bytestreams from fedora via http, on demand, and cache them in the local (riiif server) file system. The downsides of this are:

It is unclear to me how many institutions are using riiif in production, but my sense is that most or even all of them take the direct file system access approach rather than http access with local file cache. Anyone I could find using riiif at ahc was taking this approach, one way or another.

Transformed product caching

Recall a main motivation for using an on-demand image transformer is not having to store every possible derivative (including tiles) on disk.

But there can be a significant delay in producing a transformation. It can depend on size and characteristics of original image; on whether we are using local file system access or http downloading as above (and on whether the original is in local cache if latter); on network speed, disk I/O speed, and imagemagick (cpu) speed.

As a result, riiif tries to cache it's transformation output.

It uses an ActiveSupport::Cache::Store to do so, by default the one being used by your entire Rails app as Rails.cache. It probably makes sense to separate the riiif cache, so a large volume of riiif products isn't pushing your ordinary app cache content out of the cache and vice versa, and both caches can be sized appropriately, and can even use different cache backends.

ActiveSupport::Cache::Store supports caching in file system, local app memory, or a Memcached instance; or hypothetically you can easily write an adapter for any back-end store you want. But for this use case, anything but file system probably doesn't make sense, it would get too expensive for the large quantity of bytes involved. (Although one could consider things like an S3 store instead of immediate file system, that has it's own complications but could be considered).

So we have the same issues to consider we did with http original source cache: performance, and cache management.

Consider: Third-party CDN

Most commercial sector web apps these days use a third party (or at least external) CDN (Content Delivery Network) -- certainly especially image-heavy ones.

A CDN is basically a third-party cloud-hosted HTTP cache, which additionally distributes the cache geographically to provide very fast access globally.

Using a CDN you effectively can "cache everything", they usually have pricing structures (in some cases free) that do not limit your storage space significantly. One could imagine putting a CDN in front of some or all of our delivered image assets (originals, derivatives, and tile sources), You could actually turn off riiif's own image caching, and just count on the CDN to cache everything.

This could work out quite well, and would probably be worth considering for our image-heavy site even if we were not using an on-demand IIIF image server -- a specialized CDN can serve images faster than our Rails or local web server can.

Cloudflare is a very popular CDN (significant portions of the web are cached by cloudflare) which offers a free tier that would probably do everything we need.

One downside of a CDN are that it only works for public images, access-controlled images only available to some users don't work in a CDN. In our app, where images are either public or still 'in process', one could imagine pointing at cloudflare CDN cached images for public images, but serving staff-only in-process images locally.

Another downside is it would make tracking download counts somewhat harder, although probably not insurmountable, there are ways.

Image-specializing CDN or cloud image transformation service

In addition to general purpose CDNs, there exist a number of fairly successful cloud-hosted on-demand image transformation services, that effectively function as image-specific CDNs, with on-demand transformations services. They basically give you what a CDN gives you (including virtually unlmited cache so they can cache everything), plus what an on-demand image transformation service gives you, combined.

One popular one I have used before is imgix. Imgix supports all the features a IIIF server like riiif gives you -- although it does not actually support the IIIF API. Nonetheless, one could imagine using imgix instead of a local IIIF server, even with tools like JS viewers that expect IIIF, by writing a translation gateway, or writing a plugin to (eg) OpenSeadragon to read from imgix. (OpenSeadragon's IIIF support was not original, and was contributed by hydra community). (One could even imagine convincing imgix.com to support IIIF API natively).

imgix is not free, but it's pricing is pretty reasonable: "$3 per 1,000 master images accessed each month. 8¢ per GB of CDN bandwidth for images delivered each month." It's difficult for me to estimate how much bandwidth we'd end up paying for (recall our derivatives will be substantially smaller than the original uncompressed TIF sources).

An image transformation CDN like imgix would almost entirely get us out of worrying about cache management (it takes care of it for us), as well as managing disk space ourselves for storing derivatives, and CPU and other resource issues. It has the same access control and analytics issues as the general CDN.

Consider the lowest-tech solution

Is it possible we can get away without an on-demand image transformation service at all?

For derivatives (alternate formats and sizes of the whole image), we can if we can feasibly manage the disk space to simply store them all.

For pan-and-zoom, we only need a tile-source if our full-resolution (or as high resolution as we desire to support zoom in a browser to) are too big to deliver to a browser.

Note that in both cases (standard derivative or derived tile-soruce) JPGs we're delivering to the browser are significantly smaller than the uncompressed source TIFFs. In one simple experiment a 100MB source TIF I chose from our corpus turned into a 3.8MB JPG, and that's without focusing on making the smallest usable/indistinguishable JPG possible.

At least hypothetically, one could even pre-render and store all the sub-images neccesary for a tiling pan-and-zoom viewer, without using an on-demand image transformation service.

(PS: We might consider storing our original source TIF's as losslessly compressed. I believe they are entirely uncompressed now. Lossless compression could store the images with substantially smaller footprints, losing no original data or resolution).

Conclusion

We have a variety of potentially feasible paths. It's important to remember that none of them are going to be just "install it and flip the switch", they are all going to take some planning and consideration, and some time spent configuring, tweaking, and/or developing.

I guess the exception would be installing riiif in the most naive way possible, and incurring the technical debt of dealing with problems (performance and/or resource consumption) later when they arrive. Although even this would still require some UI/UX design work.

jrochkind commented 7 years ago

FYI: It looks like our original source TIFs are completely uncompresed.

I took a 82MB source TIF, and just saved it as LZW compressed (lossless!) TIF in OSX Preview, and it became 32MB.

My understanding is that lossless LZW compression should lose no image data, it is fully interchangeable with the uncompressed original. (Hypothetically some viewers might not support compression, but realistically any viewer will support a few basic lossless compression methods like LZW; additionally it will take the viewer a bit longer to open the compressed file, but on modern computers it's not an issue).

I wonder if you ever considered making our original source TIFs losslessly compressed? It would significantly reduce our corpus disk size.

Regardless, we probably want to deliver users wanting to download the "original" a losslessly compressed TIFF. It's going to significantly reduce their download time, the space on their workstation to download, as well as our server load and bandwidth usage.

Again, as far as I can tell by researching and by my understanding of compression, a losslessly compressed TIFF using eg LZW or ZIP should be totally and competely interchangeable with it's non-compressed brother. We would want to check this with digital archivist community to be sure, but I'm 99% sure. I did find a reference from a few years ago to a 'forthcoming paper' on this topic, I should see if I can follow that up (hooray penncard!).

jrochkind commented 7 years ago

https://github.com/aFarkas/lazysizes

hackartisan commented 7 years ago

this ended up being a discussion issue; closing in favor of #617