hackartisan commented 7 years ago

@jrochkind and I have a better sense of what it would mean to do iiif, and what we get by doing iiif, after hydra camp. We'd like to start this discussion by spelling out our requirements very specifically, translating the requirements into design (how a viewer, etc, should impact the page layout), and then assessing possible strategies (including iiif) for fulfilling the requirements / design.

MDiMeo commented 7 years ago

Image viewer functionality -- REQUIRED:

Page turner for books with multiple pages -- i.e. the ability to advance from one image to the next / previous image. Not side-by-side images.
Excellent zoom for our high-res digital objects
Option to download high-res Tiff or smaller Jpeg derivative for a digital object Note: smaller as in filesize, not necessarily resolution. The goal is a fast download; 3s or less.
Clear presentation of how different parts relate to the whole
- examples: surrounding thumbnails, slider showing position.
- If work is part of another work, make sure this is clear in the UI; e.g. either by showing the parent's viewer or by providing a link.

Image viewer functionality -- DESIRED:

Option to download a PDF of an entire work
Ability to rotate an image sideways
jump to by list position
Numbers or pagination of some kind for multi-file works (e.g. jump to page by original pagination)
Full-screen viewing option, with clear UI showing how to return to previous view
Optional (default?) tiff download could be compressed?

Initial Design/Functionality Thoughts and Questions:

Do we care if the image viewer opens in a new window, or do we prefer for it to be within the original screen with the metadata?
It is important to present an item in its context (for example, show where a book plate sits within the whole book). We can consider an image viewer that uses thumbnails along the side for this purpose or think about how else we can achieve this. This will lead to larger page redesign questions and whether the Item list view in Sufia is still necessary.
Rights Information must still be clearly displayed by download options. (So far, everyone seems happy with the updates to our current design.)
Interoperability and standardization should be taken into consideration when choosing a viewer, but not at the expense of compromising our required and desired functionality.
A few seconds for a viewer to pop up and images to download is acceptable, but we should be thinking about possible performance issues, as some of our Tiffs are quite large. consider in this context page turn load time.

hackartisan commented 7 years ago

@MDiMeo Does page turner mean side-by-side images? Or does it just mean the ability to progress from one to the next?

hackartisan commented 7 years ago

It might be helpful to analyze our data for a distribution of sizes of our tiffs. Our largest image is, I think, 1.2G.

jrochkind commented 7 years ago

@MDiMeo I'm not sure what some of those things mean, like "page turner" either.

I think we should probably start from user requirements like "must be able to view a multi-page book in a reasonable way", and move on to wireframe UI/UX diagrams of how we might do that.

"Numbers or pagination of some kind for multi-file works" -- I don't think we have the metadata to support this. We can tell you it's image 12 out of 84 scanned images, but we can't give you original pagination without metadata I don't believe we have. (For example https://hydra.chemheritage.org/concern/generic_works/1831ck38c, where the page with internal self-contained pagination of 1 is actually the 9th scanned image, because there is a scan of the cover, followed by 7 more pages of preferatory material not given page numbers in the original text, and only then "page 1", which also doesn't actually have a page number on it, but the next one is - 2 - :) )

as far as "distribution of sizes of TIFFs", keep in mind that in no case would we be delivering a TIFF to the browser for on page view (browsers can't even display them). Turning an (typically uncompressed) TIF into a JPG of equivalent resolution (which is compressed both losslessly and usually lossily too) results in drastic file size reduction. In one example I did, a 100MB TIFF turned into a 3.8MB JPEG.

jrochkind commented 7 years ago

Also, we're going to have design work to do even if we were to go with an existing IIIF-based JS viewer. No matter what path we choose, I don't think this is going to be something where we just "turn on the IIIF and the JS viewer" and we're done.

We need to decide how/where to incorporate it into our site, and probably do additional design and development to build out all those features based on it, if all those features are really requirements. They may not all be "built-in" to existing viewers, but viewers are designed to be customized and composed.

jrochkind commented 7 years ago

Additional things to maybe consider (or not).

Do we want a client without javascript to still be able to view basic web-sized images, and download originals? (Not sure it matters)
Do we want google and other web spiders to be able to see and index all/some of the images in web formats and easy to handle resolutions, so the show up in Google Image Search etc? (I think yes).
Might a user want to download a PDF of a multi-page work, assembling all the pages into one document (at some resolution(s))? (I think probably, although it doesn't need to be in the first pass. It would seem to make a lot of sense for a lot of our multi-page works, for instance modern-era memos and reports such as https://hydra.chemheritage.org/concern/generic_works/1831ck38c)

MDiMeo commented 7 years ago

I was using "page turner" as a shorthand for book reader because that's the term I've heard more frequently in digital library circles. Here's an old Hydra thread called page turner(!), but it's really about ways to view a multi-file work like a book: https://groups.google.com/forum/#!topic/hydra-tech/huLHB1qO-4I It's this need we must address, and, no, I don't think side-by-side is a requirement (though I used to. I feel differently now and wonder if others agree.)

We have discussed adding folio/page number metadata in the future if this feature becomes available. The viewer at the Wellcome allows you to toggle between Image number and Page number, and if there is no page number metadata then it is just blank.

And yes, the PDF download for an entire work is listed under my desired items, too! I definitely think this would help researchers and I've used this feature on other digital libraries.

In terms of functionality, I think everything I want is answered by the Wellcome's viewer: https://wellcomelibrary.org/item/b19703181#?c=0&m=0&s=0&cv=3&z=0.2396%2C0.1636%2C0.5209%2C0.3272&r=0 But we do have some other links and designs that might be helpful for comparison.

jrochkind commented 7 years ago

Sorry, I'm not really good at being concise, I always want to tell the full story! Here's a memo on some technical/operational considerations with IIIF/riiif. We're not neccesarily ready to talk about this now, design considerations come first, but I wanted to capture it while it was fresh in my head. I don't know if @MDiMeo really needs to read this, but I would (at the right time) greatly appreciate @sanfordd's thoughts.

Memo on Technical Operational Considerations for IIIF in a Sufia/Hyrax app

11 May 2017

IIIF (International Image Interoperability Framework) is a standard API for a server which delivers on-demand image transformations.

What sort of transformations are we interested in (and IIIF supports)?

Changing image formats
Resizing images (to produce thumbnails other various display or delivery sizes)
Creating tiled image sources to support high-res zoom-in without having to deliver enormous original source images. (such an operation will involve resizing too to create tiles at different zoom levels, as well as often format changes if the original source is not in JPG or other suitable web format)

@jcoyne has created Riiif, an IIIF server in ruby, using imagemagick to do the heavy-lifting, that is a Rails engine that can turn any an IIIF server. In addition to it being nice that we know ruby so can tweak it if needed, this also allows it to use your existing ruby logic for looking up original source images from app ids and access controls. It's unclear how you'd handle these things with an external IIIF server in a sufia/hyrax app; to my knowledge, nobody is using anything but riiif.

Keep in mind that the reason you need tiled image source is only when the full-resolution image (or the image at the resolution you desire to allow zoom to) in a JPG format is going to be too large to deliver in it's entirety to the browser (at least with reasonable performance). If this isn't true, you can allow pan and zoom in a browser with JS without needing a tiled image source.

And keep in mind that the primary reason you need an on demand image transformation service (whether for tiled image source or other transfomrations), is when storing all the transformations you want is going to take more disk space than you can afford or is otherwise feasible. (There are digital repositories with hundreds of thousands or millions of images, each which need various transformations).

There is additionally some development/operational convenience to an on-demand transformation aside from disk space issues, but there is a trade-off in additional complexity in other areas -- mainly in dealing with caching and performance.

The first step is defining what UI/UX we want for our app, before being able to decide if an on-demand image transformation server is useful in providing that. But here, we'll skip that step, assume we've arrived at a point from UI/UX to wanting to consider an on-demand image transformation service, and move on to consider some operational issues with deploying RIIIF.

Server/VM seperation?

riiif can conceivably be quite resource-intensive. Lots of CPU taken calling out to imagemagick to transform images. Lots of disk IO in reading/writing images (effected by cache and access strategies, see below). Lots of app server http connections/threads taken by clients requesting images -- some of which, depending on caching strategies, can be quite slow-returning requests.

In an ideal scenario, one wouldn't want this running on the same server(s) handling ordinary Rails app traffic, one would want to segregate it so it does not interfere with the main Rails app, and so each can be scaled independently.

This would require some changes to our ansible/capistrano deploy scripts, and some other infrastructure/configuration/deploy setup. The riiif server would probably still need to be deployed as the entire app, so it has access to app-located authorization and retrieval logic; but be limited to only serving riiif routes. This is all do-able, just a bunch of tweaking and configuring to set up.

This may not be necessary even if strictly ideal.

Original image access

The riiif server needs access to the original image bytestreams, so it can tranasform them.

In the most basic setup, the riiif server somehow has access to the file system fedora bytestreams are stored on, and knows how to find a byestream for a particular fedora entity on disk.

The downsides of this are that shared file systems are... icky. As is having to reverse engineer fedora's file storage.

Alternately, riiif can be set up to request the original bytestreams from fedora via http, on demand, and cache them in the local (riiif server) file system. The downsides of this are:

performance -- if a non-cached transformation is requested, and the original source image is also not in the local file system cache, riiif first must download it from fedora, before moving on to transform it, and only then delivering it to the client.
cache management. Cache management as a general rule can get surprisingly complicated. If you did not trim/purge the local 'original image source' file system cache at all, it would of course essentially grow to be the size of the complete corpus of images (which are quite large uncompressed TIFFs in our case). Kind of defeating the purpose of saving file space with an on-demand image transformer in the first place (the actual transformed products are almost always going to be in a compressed format and a fraction of the size of the original TIFFs).
- There is no built-in routine to trim original source file cache, although the basic approach is straightforward, the devil can be in the details.
- To do an LRU cache, you'd need your file system tracking access times. Linux file systems are not infrequently configured with 'noatime' for performance these days, which wouldn't work. Or alternately, you'd need to add code to riiif to track last access time in some other means.
- When trimming, you have to be careful not to trim sources currently being processed by an imagemagick transformation.
- Even if trimming/purging regularly, there is a danger of bursts of access filling up the cache quickly, and possibly exceeding volume space (unless the volume is big enough to hold all original sources of course). For instance, if using riiif for derivatives, one could imagine googlebot or another web spider visiting much of the corpus fairly quickly. (A use case ideally we want to support, the site ought to be easily spiderable)
  - There is of course a trade-off between cache size and overall end-user responsiveness percentiles.

It is unclear to me how many institutions are using riiif in production, but my sense is that most or even all of them take the direct file system access approach rather than http access with local file cache. Anyone I could find using riiif at ahc was taking this approach, one way or another.

Transformed product caching

Recall a main motivation for using an on-demand image transformer is not having to store every possible derivative (including tiles) on disk.

But there can be a significant delay in producing a transformation. It can depend on size and characteristics of original image; on whether we are using local file system access or http downloading as above (and on whether the original is in local cache if latter); on network speed, disk I/O speed, and imagemagick (cpu) speed.

It's hard to predict what this latency would be, but in the worst case with a very large source image one could conceive of it being a few seconds -- note that's per image, and you could pay it each time you move from page to page in a multi-page work, or even, pathological case, each time you pan or zoom in a pan-and-zoom viewer.

As a result, riiif tries to cache it's transformation output.

It uses an ActiveSupport::Cache::Store to do so, by default the one being used by your entire Rails app as Rails.cache. It probably makes sense to separate the riiif cache, so a large volume of riiif products isn't pushing your ordinary app cache content out of the cache and vice versa, and both caches can be sized appropriately, and can even use different cache backends.

ActiveSupport::Cache::Store supports caching in file system, local app memory, or a Memcached instance; or hypothetically you can easily write an adapter for any back-end store you want. But for this use case, anything but file system probably doesn't make sense, it would get too expensive for the large quantity of bytes involved. (Although one could consider things like an S3 store instead of immediate file system, that has it's own complications but could be considered).

So we have the same issues to consider we did with http original source cache: performance, and cache management.

Even when something is in the riiif image cache, it's not going to be as fast as an ordinary web-server-served image. ActiveSupport::Cache::Store does not support streaming, so the entire product needs to be read from the cache into local app memory before a byte of it goes to the server. (One could imagine writing an ActiveSupport::Cache::Store adapter that extends the API to support streaming).
- How much slower? Hard to say. I'd guess in the hundreds of ms, maybe less, probably not usually more but there could be pathological edge cases.
- Not actually sure how this compares to serving from fedora, I don't know for sure if the serving from fedora case also needs a local memory copy before streaming to browser. I know some people work around this with nginx tricks, where the nginx server also needs access to fedora filesystem.
And there is still a cache management issue, similar to cache management issues above.

Consider: Third-party CDN

Most commercial sector web apps these days use a third party (or at least external) CDN (Content Delivery Network) -- certainly especially image-heavy ones.

A CDN is basically a third-party cloud-hosted HTTP cache, which additionally distributes the cache geographically to provide very fast access globally.

Using a CDN you effectively can "cache everything", they usually have pricing structures (in some cases free) that do not limit your storage space significantly. One could imagine putting a CDN in front of some or all of our delivered image assets (originals, derivatives, and tile sources), You could actually turn off riiif's own image caching, and just count on the CDN to cache everything.

This could work out quite well, and would probably be worth considering for our image-heavy site even if we were not using an on-demand IIIF image server -- a specialized CDN can serve images faster than our Rails or local web server can.

Cloudflare is a very popular CDN (significant portions of the web are cached by cloudflare) which offers a free tier that would probably do everything we need.

One downside of a CDN are that it only works for public images, access-controlled images only available to some users don't work in a CDN. In our app, where images are either public or still 'in process', one could imagine pointing at cloudflare CDN cached images for public images, but serving staff-only in-process images locally.

Another downside is it would make tracking download counts somewhat harder, although probably not insurmountable, there are ways.

Image-specializing CDN or cloud image transformation service

In addition to general purpose CDNs, there exist a number of fairly successful cloud-hosted on-demand image transformation services, that effectively function as image-specific CDNs, with on-demand transformations services. They basically give you what a CDN gives you (including virtually unlmited cache so they can cache everything), plus what an on-demand image transformation service gives you, combined.

One popular one I have used before is imgix. Imgix supports all the features a IIIF server like riiif gives you -- although it does not actually support the IIIF API. Nonetheless, one could imagine using imgix instead of a local IIIF server, even with tools like JS viewers that expect IIIF, by writing a translation gateway, or writing a plugin to (eg) OpenSeadragon to read from imgix. (OpenSeadragon's IIIF support was not original, and was contributed by hydra community). (One could even imagine convincing imgix.com to support IIIF API natively).

imgix is not free, but it's pricing is pretty reasonable: "$3 per 1,000 master images accessed each month. 8¢ per GB of CDN bandwidth for images delivered each month." It's difficult for me to estimate how much bandwidth we'd end up paying for (recall our derivatives will be substantially smaller than the original uncompressed TIF sources).

An image transformation CDN like imgix would almost entirely get us out of worrying about cache management (it takes care of it for us), as well as managing disk space ourselves for storing derivatives, and CPU and other resource issues. It has the same access control and analytics issues as the general CDN.

Consider the lowest-tech solution

Is it possible we can get away without an on-demand image transformation service at all?

For derivatives (alternate formats and sizes of the whole image), we can if we can feasibly manage the disk space to simply store them all.

For pan-and-zoom, we only need a tile-source if our full-resolution (or as high resolution as we desire to support zoom in a browser to) are too big to deliver to a browser.

Note that in both cases (standard derivative or derived tile-soruce) JPGs we're delivering to the browser are significantly smaller than the uncompressed source TIFFs. In one simple experiment a 100MB source TIF I chose from our corpus turned into a 3.8MB JPG, and that's without focusing on making the smallest usable/indistinguishable JPG possible.

At least hypothetically, one could even pre-render and store all the sub-images neccesary for a tiling pan-and-zoom viewer, without using an on-demand image transformation service.

(PS: We might consider storing our original source TIF's as losslessly compressed. I believe they are entirely uncompressed now. Lossless compression could store the images with substantially smaller footprints, losing no original data or resolution).

Conclusion

We have a variety of potentially feasible paths. It's important to remember that none of them are going to be just "install it and flip the switch", they are all going to take some planning and consideration, and some time spent configuring, tweaking, and/or developing.

I guess the exception would be installing riiif in the most naive way possible, and incurring the technical debt of dealing with problems (performance and/or resource consumption) later when they arrive. Although even this would still require some UI/UX design work.

jrochkind commented 7 years ago

FYI: It looks like our original source TIFs are completely uncompresed.

I took a 82MB source TIF, and just saved it as LZW compressed (lossless!) TIF in OSX Preview, and it became 32MB.

My understanding is that lossless LZW compression should lose no image data, it is fully interchangeable with the uncompressed original. (Hypothetically some viewers might not support compression, but realistically any viewer will support a few basic lossless compression methods like LZW; additionally it will take the viewer a bit longer to open the compressed file, but on modern computers it's not an issue).

I wonder if you ever considered making our original source TIFs losslessly compressed? It would significantly reduce our corpus disk size.

Regardless, we probably want to deliver users wanting to download the "original" a losslessly compressed TIFF. It's going to significantly reduce their download time, the space on their workstation to download, as well as our server load and bandwidth usage.

Again, as far as I can tell by researching and by my understanding of compression, a losslessly compressed TIFF using eg LZW or ZIP should be totally and competely interchangeable with it's non-compressed brother. We would want to check this with digital archivist community to be sure, but I'm 99% sure. I did find a reference from a few years ago to a 'forthcoming paper' on this topic, I should see if I can follow that up (hooray penncard!).

jrochkind commented 7 years ago

https://github.com/aFarkas/lazysizes

hackartisan commented 7 years ago

this ended up being a discussion issue; closing in favor of #617

sciencehistory / chf-sufia

IIIF #312

Memo on Technical Operational Considerations for IIIF in a Sufia/Hyrax app

Server/VM seperation?

Original image access

Transformed product caching

Consider: Third-party CDN

Image-specializing CDN or cloud image transformation service

Consider the lowest-tech solution

Conclusion