ruven / iipsrv

iipsrv is an advanced high-performance feature-rich image server for web-based streamed viewing and zooming of ultra high-resolution images.
https://iipimage.sourceforge.io
GNU General Public License v3.0
288 stars 114 forks source link

Support for cloud object storage (S3, Swift) ? #231

Open sguimmara opened 2 years ago

sguimmara commented 2 years ago

Hello,

Does IIPImage supports accessing images elsewhere than a filesystem, such as a cloud object storage (and particularly swift) ?

Thank you

ruven commented 2 years ago

Not at the moment. To do it you'd have to find a way to mount your swift storage as a virtual file system.

But, it would indeed be good to be able to access things like swift and aws storage directly through IIPImage. It risks to be quite inefficient, however, unless your images are optimized for cloud storage - you'd have to use something like cloud-optmized GeoTIFF: https://www.cogeo.org/ to make sure random access is fast enough

sguimmara commented 2 years ago

Thanks for the answer !

Indeed, we are currently facing a dilemma. We must move our JPEG 2000 images from ordinary filesystems into Swift storage. But we were worried that IIPImage would no longer be able to serve them. From what I understand with your answer, our current solutions are :

ruven commented 2 years ago

As your images are currently in JPEG2000 format, I think your best and most flexible option at this stage would be to mount a virtual FS with your swift storage.

In the longer term, I'll look into adding native swift support to IIPImage to avoid the need for a virtual FS. In such a configuration, a format such as COG would be much faster than using JPEG2000 or normal TIFF.

scossu commented 2 years ago

When I started evaluating IIIF image servers for my institution, I was initially taken aback by the lack of storage options of IIPImage. Afterwards, I actually found this limitation to be a good thing, that keeps IIPImage simple and reliable. S3 is several times slower than an SSD directly attached to the server or mounted via NFS over a fiber-channel network.

We ended up writing a small piece of Python middleware that does the following:

Along with some basic maintenance tools such as clearing the cache volume on demand and routinely by pruning older files, it's a relatively simple, low-maintenance addition that allows you to have your image sources anywhere, without depending on the image server features. Also, it allows you to provide fast access to frequently used sources without paying a fortune to store all your images in an SSD.

sguimmara commented 2 years ago

@scossu Thanks for the report !

Our situation is that we have ~150 TB of JPEG 2000 images (around 4 millions files), that are currently served from a filesystem through IIPImage. Now, the vast majority of those images are, rarely if never, going to be served, and will remain in the archive for years without someone to touch them. More are added every month.

Converting everything to COG in advance seems like a huge overkill, in term of processing power, and storage cost, since COG would be 3 to 10 times bigger than the original image to maintain lossless quality.

My initial thought was to keep the JPEG2000 archive as is, but generate a lossy COG copy with gdal_translate for browser use (served via a simple HTTP server), and serve the original JPEG2000 in direct download if requested. Generating a COG on the fly is not instant (several seconds) however, so the almost zero latency of Cloud Optimized GeoTIFF would be offset by this initial conversion time.

In your scenario, the file would be fetched from object storage into a nearby cache to be served as is by IIIPImage.

On the top of my head, I don't know which scenario would have the lowest latency from user request to image display, but it seems yours should be faster, since there would be no conversion step. However once the COG is generated, it would be served by a simple HTTP server. This would appear to scale better and would unload a lot of work from the backend.

ruven commented 2 years ago

@scossu's solution is indeed a good option. The only drawback is that the very first request to a new image not in cache will be very slow as you have to copy the whole file across first. All subsequent requests will, however, be very fast.

Regarding the use of COG directly through HTTP, it really depends what you want to be able to do with the images. Don't forget that COG is still a TIFF file. The only difference between COG and classic TIFF is just related to how the internal TIFF metadata in the file is ordered. COG puts all this information at the beginning of the file, whereas in classic TIFF, this can be scattered throughout. A COG HTTP request will give you direct access to the compressed tiles and not to transcoded images as you would get with an image server such as IIPImage. You also won't be able to get anything that isn't a tile, such as image overviews, arbitrary regions or be able to apply any image processing, unless you handle this through some client-side javascript.

If you want the fastest possible access to tiles with no intervening image server and no need for client-side JS, then the old Zoomify or Deepzoom approach would be your best bet - you just pre-generate the JPEG tiles and store them all as separate files on your cloud server, which would server them directly to the browser.

since COG would be 3 to 10 times bigger than the original image to maintain lossless quality.

By the way, lossless tiled pyramid TIFF (COG or not) will be about twice as large as lossless JPEG2000 (and similar in size to the raw image size).

sguimmara commented 2 years ago

it really depends what you want to be able to do with the images

The goal is to be able to visualize the (8-bit) images in a browser. The user is the general public. So : pan, zoom, that's it. No need for image processing.

You also won't be able to get anything that isn't a tile, such as image overviews, arbitrary regions or be able to apply any image processing, unless you handle this through some client-side javascript.

If we switched to COG, we would to use OpenLayers with the GeoTIFF source. This would remove most of the load from the server, and reduce the costs.

If we stayed with JPEG2000s, we would probably keep IIPImage, with the aforementioned drawbacks.

joesong168 commented 1 year ago

or client-side JS, then the old Zoomify or Deepzoom

Agree that this would be the fastest way to serve tiles in large scale. However, maybe we could gain speed and dynamic serving together. If we think about IIP serve to include the computing power in browser, why couldn't we manipulate tiles in browser platform such as WASM and keep all tiles static on image server.

ruven commented 1 year ago

If we think about IIP serve to include the computing power in browser, why couldn't we manipulate tiles in browser platform such as WASM and keep all tiles static on image server.

Yes, if you use something like COG, you would have direct access to the raw image data and would be able to offload all processing to the browser itself through JS and WASM..

joesong168 commented 1 year ago

Execute me, what's COG?

sguimmara commented 1 year ago

Execute me, what's COG?

@joesong168 Cloud Optimized GeoTIFF is a way to stream imagery (a bit like what IIPServ does). The benefit of COGs is that they don't need a particular service (a simple HTTP server is enough).

joesong168 commented 1 year ago

@sguimmara Thanks for your explanation. Have you tried Juice FS? It is a cloud native file system that might serve your need. https://juicefs.com/en/