Enhancements in thumbnail generation and image viewer for non-standard formats

wladimirleite commented 3 years ago

While working on #515 I noticed a few things that could be improved in thumbnail generation and image viewer for some image formats.

1. External conversion geometry

External conversion (ImageMagick) currently uses the -sample parameter to limit the image size, both when generating thumbnails and for viewing images that are not handled by the "internal" (ImageIO) process. This parameter receives a "geometry". Currently just a single value is used, which means the width. As it scales the image keeping its aspect ratio, this process can generate images larger than necessary for portrait input images.

My suggestion is to limit both dimensions, as it is done in the "internal" process. This is easily supported by ImageMagick (and GraphicsMagick), just using -sample 480x480 instead of -sample 480.

An example of a pathological case (2 x 200 pixels), which would generate a 480 x 48000 pixels (during thumbnail generation process, if it requires external conversion):

sep

wladimirleite commented 3 years ago

2. Avoid conflict between IM temp cleaning and running conversions

While running a batch of tests with non-standard formats, I noticed that some valid images sometimes don't have a valid thumbnails generated. And the failed images change across different tests.

Digging into this issue, I found out that it was caused by the concurrency between the "ImageMagick temp cleaner" thread and ongoing conversions (i.e. the cleaning process sometimes delete files that are still being used by a running conversion, which may fail in such situation).

Running several times the same test, with 15K valid images of non-standard formats, usually between 150 and 200 them fail during thumbnail generation process (about 1%). The frequency of these failures is highly dependant on how large processed images are, their formats, the number of processing threads etc.

I added a test to the cleaning process, so it will check the last modified date of the file, and only delete "older" ones. I set a 30 seconds threshold, which seems to be enough, but can be later adjusted. EDIT: Increased to 60 seconds after more tests.

wladimirleite commented 3 years ago

3. Use image viewer instead of LibreOffice viewer for XBM, SVG and WMF

These three formats were being handled by ImageMagick to generate thumbs and viewed using LibreOffice viewer:

image/svg+xml
image/x-portable-bitmap
image/wmf

If the new "DocThumbs" feature is enabled, their thumbnails would be generated using LibreOffice, which is several times slower than the usual process. This was not intended, just a side effect of copying all mime types handled by LibreOffice viewer to DocThumbTask.

Running tests with a very large set, I noticed that both thumbnail generation and image visualization works fine (and faster) using ImageMagick. So I propose changing this behavior for these formats.

There are a few visual differences in a small portion of the files (less than 5% of samples I have here), but the results are still acceptable (i.e. the file content is readable, just rendered in a slightly different way).

There are a few issues related to resolution (affecting SVG and WMF), addressed in the following items.

wladimirleite commented 3 years ago

4. IM conversion density

Currently a fixed value (96) is used for the -density parameter used for external image conversion. This affects only vectorial images (EMF, WMF, SVG). I noticed that for some images that contain text (specifically EMF files that sometimes contain important forensic evidences, related to printing in Windows) are rendered poorly by the image viewer. Increasing this value produces better images in many cases.

So I propose adding an explicit parameter (boolean) to the external image conversion, telling if a high resolution image is wanted. In such cases it would use a higher density (300). Thumbnail generation would keep using the current value. The image viewer and the OCR (#515) would request a better (high resolution) image.

wladimirleite commented 3 years ago

5. Improve scaling for vector images

Vector graphics formats handled by IM (e.g. EMF) sometimes are being rendered poorly in the image viewer. Part of the problem is the density described above. But I noticed that many images the problem also comes from the dimensions returned by IM and later used in the -sample conversion parameter. For vector images, the returned values are meaningless (when interpreted in pixels) most of the times.

An example of an EMF image, as rendered in the viewer, using version 3.18.6:

The proposed solution is not to use the informed dimension for vector images, when converting with ImageMagick, using the "maximum dimension" instead (a constant in the image viewer).

The same EMF file, as rendered in the viewer, after this modification and the density mentioned before:

wladimirleite commented 3 years ago

6. Increase viewer maximum dimension

This is a minor detail, but I noticed that the current limit of 2000 pixels used to limit the dimension of loaded images, used both by the "internal" (ImageIO) and the "external" (ImageMagick) processes may be too small in some cases (i.e. even zooming in, the resolution is not enough to read small characters in the image viewer, although they are visible in the original file). This is noticeable in specific images, that contain small text, usually with portrait orientation.

I propose a small change, increasing this "maximum dimension" from 2000 to 2400 pixels. Higher values may increase the time to convert (if necessary) / load, and the used memory. It would add 44% more pixels, which already helped reading small text in some test images I have here.

wladimirleite commented 3 years ago

I noticed that the recently added "Find Similar Faces" feature uses image dimensions to calculate the correct position of the detected face to be displayed in the image viewer. After making this changes, I will test it to check if the coordinates calculation is still working as expected.

lfcnassif commented 3 years ago

Great @tc-wleite, this all was really a very careful investigation, thank you very much!

I think 2 is clearly a bug and could have a separate issue for it. Others are great improvements for me and I agree with all of them.

About 4, current dpi for pdf to img for ocr is 250. A long time ago I did some tests and thought 250 a good compromise between accuracy and speed, and I was aware about 300 dpi recommended by tesseract. I think we should use the same here or there, 250 or 300, not sure what is better today.

wladimirleite commented 3 years ago

Great @tc-wleite, this all was really a very careful investigation, thank you very much!

I think 2 is clearly a bug and could have a separate issue for it. Others are great improvements for me and I agree with all of them.

Thanks @lfcnassif!
I thought about creating a separated issue, but as this one is already an "extra issue" that came from #515, I ended up putting everything here, although the second item is not directly related to the others as you noticed.

About 4, current dpi for pdf to img for ocr is 250. A long time ago I did some tests and thought 250 a good compromise between accuracy and speed, and I was aware about 300 dpi recommended by tesseract. I think we should use the same here or there, 250 or 300, not sure what is better today.

I made some tests with the visualizer (still not the OCR), and 300 was clearly better than 200. I will check if 250 is already enough in most cases, as a higher value also means a bit slower process.

wladimirleite commented 3 years ago

After implements these changes, I found other two minor problems. As this issue already mix too many things (sorry about that), I will create other issues.

wladimirleite commented 3 years ago

Added other minor adjustments to external conversion process:

Use -resize together with -sample (currently only the latter is used). This increased the quality of many externally converted images, using similar processing time.
Use dimensions as a maximum in geometry parameter. It is related to the first item described above. With 480x480> (adding a "greater than" in the end), the image will only be reduced to 480, not enlarged (in the case of small images).
Increase MAGICK_AREA_LIMIT from 10MP to 32MP.

lfcnassif commented 3 years ago

I noticed some while ago this behavior of small thumbs being enlarged to max thumbnail size (your second point above) but not sure if it is happening to external conversion only...

wladimirleite commented 3 years ago

I noticed that the recently added "Find Similar Faces" feature uses image dimensions to calculate the correct position of the detected face to be displayed in the image viewer. After making this changes, I will test it to check if the coordinates calculation is still working as expected.

I managed to configure Python/JEP and ran "Find Similar Faces" here. Very cool feature!!! I got impressive results in a smaller test case.

I just had to make a small adjustment in the ImageViewer, as after the changes described here sometimes the ratio between the original image and the one loaded/displayed in the viewer is not an integer as before (old "sampling" variable).

lfcnassif commented 3 years ago

I managed to configure Python/JEP and ran "Find Similar Faces" here.

Great @tc-wleite! Please, if you had to configure something to make Python/JEP work different than the steps documented in the manual, please let me know! Or feel free to complement them, I think the wiki is publicly editable.

lfcnassif commented 3 years ago

The "Find Similar Faces" feature actually could be changed to work without JEP, as the hard work is being done in external python processes, this is a possible improvement to easy its usage. But maybe we could make JEP easier to be installed/configured by users, thinking about other future deeplearning modules, not sure if it is possible...

wladimirleite commented 3 years ago

I managed to configure Python/JEP and ran "Find Similar Faces" here.

Great @tc-wleite! Please, if you had to configure something to make Python/JEP work different than the steps documented in the manual, please let me know! Or feel free to complement them, I think the wiki is publicly editable.

Most of the information was already clear in the manual. There were a couple of steps that I had some trouble, but it can be something that depend on my environment. Anyway, I will try to add one or two remarks that can be useful to other people.

wladimirleite commented 3 years ago

I managed to configure Python/JEP and ran "Find Similar Faces" here.

Great @tc-wleite! Please, if you had to configure something to make Python/JEP work different than the steps documented in the manual, please let me know! Or feel free to complement them, I think the wiki is publicly editable.

@lfcnassif, I just added a paragraph to Python/JEP section in the Wiki with a couple of observations. Not sure if I picked the best place, formatting and wording, so feel free to make any adjustments.

lfcnassif commented 3 years ago

Thanks @tc-wleite, maybe I will just omit your python version 3.9 because I think it is not compatible with tensorflow 2.x, needed by YahooNSFWNudityTask, so users will not be encouraged to use it.

wladimirleite commented 3 years ago

Thanks @tc-wleite, maybe I will just omit your python version 3.9 because I think it is not compatible with tensorflow 2.x, needed by YahooNSFWNudityTask, so users will not be encouraged to use it.

Oh, that is important! I did install tensorflow, but don't remember which version. But I didn't test YahooNSFWNudityTask. I will run a quick test to confirm it is not working with Python 3.9.5, and will update my comment in the Wiki (removing the reference to Python 3.9.5).

wladimirleite commented 3 years ago

Thanks @tc-wleite, maybe I will just omit your python version 3.9 because I think it is not compatible with tensorflow 2.x, needed by YahooNSFWNudityTask, so users will not be encouraged to use it.

Oh, that is important! I did install tensorflow, but don't remember which version. But I didn't test YahooNSFWNudityTask. I will run a quick test to confirm it is not working with Python 3.9.5, and will update my comment in the Wiki (removing the reference to Python 3.9.5).

Using Python 3.9.5, Pillow 8.2.0, Keras 2.4.3 and TensorFlow 2.5.0 (latest versions of these modules), YahooNSFWNudityTask did work in a small test (~200 images) here (produced a meaningful scores). But it was kind of slow and there were many errors on the execution log, which were probably related to using these newer versions. For now, I removed the reference to Python 3.9.5 from the Wiki.

lfcnassif commented 3 years ago

Thanks Wladimir. It is somewhat slow indeed. Now that I'm a bit more experienced with python, I plan to try improving #357

lfcnassif commented 3 years ago

Closed by #583, thanks @tc-wleite!

lfcnassif commented 3 years ago

Increase MAGICK_AREA_LIMIT from 10MP to 32MP.

hey @tc-wleite just to confirm you saw this relates to width x heigh, right? If 4 bytes are used per pixel, this could result in 128MB per image in memory, just thinking about many images being processed at the same time... did you get good performance improvements with this in your tests?

wladimirleite commented 3 years ago

Increase MAGICK_AREA_LIMIT from 10MP to 32MP.

hey @tc-wleite just to confirm you saw this relates to width x height, right? If 4 bytes are used per pixel, this could result in 128MB per image in memory, just thinking about many images being processed at the same time... did you get good performance improvements with this in your tests?

Yes, pixels. It is an upper limit, so not all conversions would necessarily require this amount of memory, right? And the input image needs to be large, as now we are using ">" (smaller images won't be enlarged). But in the worst case (many large images, that require external conversion, being processed in the same time) I guess it would be (numThreads x 128 MB), right? That still doesn't seem that much, as the memory is used in an external process (which starts and finishes after a few seconds).

For thumbnail generation, there shouldn't be any difference, as sample parameter should reduce the number of pixels loaded a lot. For the viewer (and for the OCR when #515 is implemented), this can make quite some difference, in some cases.

EMF (and other vector formats) seem to be more sensible to this parameter. As they are basically commands to draw over a canvas, if not all pixels can fit in the memory, it will need to keep swapping from temporary files to memory and back.

I found a case (a EWF file) that the conversion for the viewer (higher resolution) takes 46 seconds with 10 MP, and only 3 seconds with 12 MP. These measures were made using command line option "-bench", and without anything else running (no disk concurrency, that would normally happen).

These pathological cases are not common, at all. I increased to 32 MP to be safe, and because OCR will probably require a bit more pixels than for viewing. For the viewer the new dimension limit is 2400. An A4 page, with 250 dpi, would have ~2925 pixels in the largest dimension. I will run tests, but my initial idea is to use this number (rounded up to 3000) as the maximum dimension when converting to OCR.

My conclusion: for most images, 10 MP is fine, but for some of them, increasing this parameter can make a huge difference. A smaller increase (to 16 MP or 24 MP) would probably cover most of the cases, so if you are concerned about overall memory usage, we can move to an intermediate value.

wladimirleite commented 3 years ago

Another detail that I forgot to mention, as external conversion for thumbnail generation is not affected by this parameter, I noticed this behavior visualizing some images. In tests with #515, this parameter should make more difference. So it will be possible to measure the performance with different values. I will update here when I have actual numbers (it will take a few days though).

lfcnassif commented 3 years ago

Yes, pixels. It is an upper limit, so not all conversions would necessarily require this amount of memory, right?

I think so.

But in the worst case (many large images, that require external conversion, being processed in the same time) I guess it would be (numThreads x 128 MB), right?

Right.

For thumbnail generation, there shouldn't be any difference, as sample parameter should reduce the number of pixels loaded a lot.

Do you know if -sample parameter works for thumbnail generation of large vectorial images?

I found a case (a EWF file) that the conversion for the viewer (higher resolution) takes 46 seconds with 10 MP, and only 3 seconds with 12 MP.

Wow, that's a huge difference!

An A4 page, with 250 dpi, would have ~2925 pixels in the largest dimension. I will run tests, but my initial idea is to use this number (rounded up to 3000) as the maximum dimension when converting to OCR.

So the image for OCR initially would use at most 9MP in memory right? And bigger images would be cached to disk temporarily as I understood with old value.

My conclusion: for most images, 10 MP is fine, but for some of them, increasing this parameter can make a huge difference. A smaller increase (to 16 MP or 24 MP) would probably cover most of the cases, so if you are concerned about overall memory usage, we can move to an intermediate value.

I just thought about externalizing this parameter to ImageThumbConfig.txt file, so users with very hard memory limits could decrease it. Decreasing image conversion time from 46s to 3s, even in corner cases, is a big difference! But actually the new 32MP value here is much better than the change we did on #81, I think it has no upper bound limit for memory usage, a mixed approach like MAGICK_AREA_LIMIT would be better for internal conversion too...

wladimirleite commented 3 years ago

Do you know if -sample parameter works for thumbnail generation of large vectorial images?

It does work, but I am not sure what happens internally. An example of time measurements using -bench 10 (average of 10 executions) on a large EMF file:

-density 96 -sample "320x320>"  -resize "160x160>" : 330 ms
-density 96 -resize "160x160>" : 392 ms
-density 250 -sample "320x320>"  -resize "160x160>" : 651 ms
-density 250 -resize "160x160>" : 1072 ms

It is clear that without -sample it is slower, but -density is also very important.

So the image for OCR initially would use at most 9MP in memory right? And bigger images would be cached to disk temporarily as I understood with old value.

There is another detail with using -sample directly to the desired dimensions. It produces poor quality images in some cases. In internal conversion, we already use a factor of 3 for sampling, before resizing. For the external conversion there were no such factor. However, thumbnail generation requested a larger image (3x the target dimension).

I changed this. Now both thumbnail generation and the viewer request the actual dimension for external conversion. And added a factor of 2 for sampling, and then resizing. That produced much better images for the viewer (as the example above in this thread). So images for the viewer now use -density 250 -sample "4800x4800>" -resize "2400x2400>". For OCR (using 3000 as the target dimension) it could reach 36MP for a large square image.

I just thought about externalizing this parameter to ImageThumbConfig.txt file, so users with very hard memory limits could decrease it. Decreasing image conversion time from 46s to 3s, even in corner cases, is a big difference! But actually the new 32MP value here is much better than the change we did on #81, I think it has no upper bound limit for memory usage, a mixed approach like MAGICK_AREA_LIMIT would be better for internal conversion too...

You are right, externalizing seems the a great idea. As I said, I want to come up with a better default value after tests with OCR, but it is better to allow changing it. Even the densities could be externalized. Something like:

lowResDensity = 96
highResDensity = 250

What do you think?

A last detail, I still need to test, but I am considering using a lower density value for low resolution. The performance gain is very small for images that I tested, but it could be meaningful in corner cases. For thumbnails, 96 is in fact more than enough.

lfcnassif commented 3 years ago

There is another detail with using -sample directly to the desired dimensions. It produces poor quality images in some cases. In internal conversion, we already use a factor of 3 for sampling, before resizing. For the external conversion there were no such factor. However, thumbnail generation requested a larger image (3x the target dimension).

I changed this. Now both thumbnail generation and the viewer request the actual dimension for external conversion. And added a factor of 2 for sampling, and then resizing. That produced much better images for the viewer (as the example above in this thread). So images for the viewer now use -density 250 -sample "4800x4800>" -resize "2400x2400>". For OCR (using 3000 as the target dimension) it could reach 36MP for a large square image.

Great, perfect! Thank you @tc-wleite, now I understood.

You are right, externalizing seems the a great idea. As I said, I want to come up with a better default value after tests with OCR, but it is better to allow changing it. Even the densities could be externalized. Something like:
lowResDensity = 96
highResDensity = 250
What do you think?

I think externalizing memory limit and densities is great! This makes sense because density for PDF->IMAGE before OCR is already externalized.

wladimirleite commented 3 years ago

A last detail, I still need to test, but I am considering using a lower density value for low resolution. The performance gain is very small for images that I tested, but it could be meaningful in corner cases. For thumbnails, 96 is in fact more than enough.

Tried lower density values (48 and 72), the performance gain for vector images (in a large set) was small (~5% for 48, ~2% for 72), and the thumbnail quality was much worse, so I will keep it as it was (96).

wladimirleite commented 3 years ago

I ran a larger test (~50K images and PDFs, from different cases and formats, most of them containing text), including OCR (after implementing #515), and using different values of MAGICK_AREA_LIMIT.

MAGICK_AREA_LIMIT (MP)     10     16     32     64
--------------------------------------------------
OCR Total Time (s)       2705   2579   2560   2572
Image Thumb Time (s)      550    529    527    577
Successful Thumbs       50609  50649  50662  50657
Thumb Timeouts             98     58     45     50
OCR Timeouts                5      4      4      4

The overall difference was small, but from 10MP (previous setting) to 32MP, more images had their thumbs generated and performance of thumb generation task and OCR (part of the images required external conversion) was slightly better. So I would keep 32MP as the default values for now.

lfcnassif commented 3 years ago

Great @tc-wleite! As you have already collected a good dataset and run those tests, is it possible to run a last test using 300dpi to image/pdf conversion before OCR? Did results above use 250dpi?

wladimirleite commented 3 years ago

I used 3000 pixels as the maximum dimension for converted image, which is approximately an A4 page with 250 dpi. This is a parameter in OCR configuration. It only reduces images while converting, so it won't enlarge smaller images. For vectorial images, I used 250 as the density parameter. I didn't use explicit DPI calculation because many images have incomplete or unreliable information.

I will repeat this same test, using 300 as the density for OCR conversion, and 3500 pixels as the maximum dimension (~ a page with 300 dpi). I expect that "OCR Total Time" will grow, as about 1/3 of these test images require external conversion, and the time spent by tesseract should increase too.

lfcnassif commented 3 years ago

Thanks, Wladimir! It would be very interesting the running time and common words hits numbers comparison.

wladimirleite commented 3 years ago

With 300, OCR total time went from 2560 s to 2814 s. A comparison of the number of hits for the words you mentioned in #206.

         250   300
------ ----- -----
para    5069  5031
com     5961  5954
de      9796  9808
em      5569  5476
até     1504  1493
você     686   686
nós     1570  1569
não     3101  3091
you      197   198
they      17    17
are      145   142
to      2079  2061
of      7241  7238
------ ----- -----
TOTAL  42935 42764

lfcnassif commented 3 years ago

Thanks @tc-wleite! Interesting, so let's keep 250dpi/3000pixels.

sepinf-inc / IPED

Enhancements in thumbnail generation and image viewer for non-standard formats #575