Serving large JPEG files fails after retrieving info.json - Githubissues

uoregon-libraries / rais-image-server

RAIS: A IIIF-compliant, 100% open source image server for blazing-fast deep zooming

Creative Commons Zero v1.0 Universal

78 stars 6 forks source link

Serving large JPEG files fails after retrieving info.json #38

Closed ghost closed 3 years ago

ghost commented 3 years ago

Lastest docker images fails serving large JPEG files after retrieving info.json. Versions prior to 4.0.1 are not affected.

To reproduce I followed those steps:

Serve a 10MB JPEG file through RAIS docker image ≥ 4.0.1 (e.g. https://upload.wikimedia.org/wikipedia/commons/f/ff/Pizigani_1367_Chart_10MB.jpg).
Retrieve the info.json for this image. This works once.
Any other call to this image (info, crop, resize) fails with a 500.

Logs:

2021/04/16 07:39:36.256 - rais-server - DEBUG - SchemeMap translated "/images/jpg/sample.jpg" to URL "file:///var/local/images/images/jpg/sample.jpg"
2021/04/16 07:39:36.256 - rais-server - DEBUG - Loading image data from image resource (id: /images/jpg/sample.jpg)
2021/04/16 07:39:36.259 - rais-server - ERROR - Error getting image and/or IIIF Info for "/images/jpg/sample.jpg": 445: 0x7fa544013190 - <nil>
2021/04/16 07:39:36.259 - rais-server - INFO - Request: [172.18.0.2:48168,172.18.0.1] /images%2fjpg%2fsample.jpg/info.json - 500

jechols commented 3 years ago

Thanks for all the details; I'll take a look at this today for sure.

jechols commented 3 years ago

I believe I have figured out the problem, but the solution may make things even slower than they already are for ImageMagick decoding.

It appears that the imagemagick disk cache (which seems to live in /tmp) is only meant to be used for a single process per image, and once that cache has been used by an in-memory resource, it cannot be used by another resource. So either I force imagemagick to create a new cache even when operating on the same image (this could wreck disk space with heavy loads) or I clean up the image's cache after every read (which means not sharing info when reading things like image size vs. doing the decoding).

I'll keep looking at options here.

jechols commented 3 years ago

For my own info if I have to continue this work next week: PingImage looks like a much better way to read image data from ImageMagick. Should solve the double-read efficiency problem described above.

jechols commented 3 years ago

Using PingImage and not reusing the ImageMagick struct dramatically reduces problems, but does not eliminate them. The cache stored in /tmp doesn't appear to be thread-safe in any way. Two requests for large images, even when they're two different large images, will fail.

There must be a way to handle this from the ImageMagick APIs, so I'll keep digging.

jechols commented 3 years ago

To get a final fix, I'm going to force ImageMagick requests to be sequential instead of allowing them to be concurrent. This is really horrible, but at this point I've come to the conclusion that the problem is concurrency, not limits. I can verify that setting the disk limit to 10 bytes prevents any large image request from working, while setting it to infinity makes large images work just fine... so long as they're requested sequentially.

What kills me is that I'm certain it's got something to do with the internal way it tries to handle the temp files, because when you run two separate instances of the ImageMagick convert program, they're fine. It's just a problem when two operations are trying to take place concurrently within the same process.

jechols commented 3 years ago

Unfortunately I won't have time to get a new docker release ready today. You can pull the latest changes from the develop branch and build a docker image manually if necessary, otherwise it won't be until the middle of next week....

ghost commented 3 years ago

Thanks for all your work on this issue, I'll pay attention to the next release

jechols commented 3 years ago

FYI this should be fixed now. Take a look and let me know!

ghost commented 3 years ago

I tried the new build and it works, thanks. I'm working on a IIIF benchmark POC, so I quickly compared 4.0.0 and 4.1.0 versions and could indeed see a performance degradation with more concurrency.

jechols commented 3 years ago

Yes, that's not surprising, though it certainly is unfortunate.

For a small set of images, setting up an in-memory tile cache will make a tremendous difference, so long as you've got the RAM (see https://github.com/uoregon-libraries/rais-image-server/wiki/Caching). A filesystem-backed tile cache has been considered on and off, but has never made the cut. It would add a lot of complexity to ensure the filesystem doesn't fill up crazy-fast during traffic spikes, it's something of a niche problem, and there are dedicated external caches that do the job better than something added into RAIS.

If you are looking to benchmark, tiled, multi-resolution JP2 images are really what RAIS is built for. It'll never be amazing for images that have to be fully decoded just to serve a single tile, but it's pretty great for JP2s, thanks to the openjpeg libraries.