Closed wladimirleite closed 3 years ago
2. Avoid conflict between IM temp cleaning and running conversions
While running a batch of tests with non-standard formats, I noticed that some valid images sometimes don't have a valid thumbnails generated. And the failed images change across different tests.
Digging into this issue, I found out that it was caused by the concurrency between the "ImageMagick temp cleaner" thread and ongoing conversions (i.e. the cleaning process sometimes delete files that are still being used by a running conversion, which may fail in such situation).
Running several times the same test, with 15K valid images of non-standard formats, usually between 150 and 200 them fail during thumbnail generation process (about 1%). The frequency of these failures is highly dependant on how large processed images are, their formats, the number of processing threads etc.
I added a test to the cleaning process, so it will check the last modified date of the file, and only delete "older" ones. I set a 30 seconds threshold, which seems to be enough, but can be later adjusted. EDIT: Increased to 60 seconds after more tests.
3. Use image viewer instead of LibreOffice viewer for XBM, SVG and WMF
These three formats were being handled by ImageMagick to generate thumbs and viewed using LibreOffice viewer:
image/svg+xml
image/x-portable-bitmap
image/wmf
If the new "DocThumbs" feature is enabled, their thumbnails would be generated using LibreOffice, which is several times slower than the usual process. This was not intended, just a side effect of copying all mime types handled by LibreOffice viewer to DocThumbTask.
Running tests with a very large set, I noticed that both thumbnail generation and image visualization works fine (and faster) using ImageMagick. So I propose changing this behavior for these formats.
There are a few visual differences in a small portion of the files (less than 5% of samples I have here), but the results are still acceptable (i.e. the file content is readable, just rendered in a slightly different way).
There are a few issues related to resolution (affecting SVG and WMF), addressed in the following items.
4. IM conversion density
Currently a fixed value (96) is used for the -density
parameter used for external image conversion.
This affects only vectorial images (EMF, WMF, SVG).
I noticed that for some images that contain text (specifically EMF files that sometimes contain important forensic evidences, related to printing in Windows) are rendered poorly by the image viewer.
Increasing this value produces better images in many cases.
So I propose adding an explicit parameter (boolean) to the external image conversion, telling if a high resolution image is wanted. In such cases it would use a higher density (300). Thumbnail generation would keep using the current value. The image viewer and the OCR (#515) would request a better (high resolution) image.
5. Improve scaling for vector images
Vector graphics formats handled by IM (e.g. EMF) sometimes are being rendered poorly in the image viewer.
Part of the problem is the density described above.
But I noticed that many images the problem also comes from the dimensions returned by IM and later used in the -sample
conversion parameter.
For vector images, the returned values are meaningless (when interpreted in pixels) most of the times.
An example of an EMF image, as rendered in the viewer, using version 3.18.6:
The proposed solution is not to use the informed dimension for vector images, when converting with ImageMagick, using the "maximum dimension" instead (a constant in the image viewer).
The same EMF file, as rendered in the viewer, after this modification and the density mentioned before:
6. Increase viewer maximum dimension
This is a minor detail, but I noticed that the current limit of 2000 pixels used to limit the dimension of loaded images, used both by the "internal" (ImageIO) and the "external" (ImageMagick) processes may be too small in some cases (i.e. even zooming in, the resolution is not enough to read small characters in the image viewer, although they are visible in the original file). This is noticeable in specific images, that contain small text, usually with portrait orientation.
I propose a small change, increasing this "maximum dimension" from 2000 to 2400 pixels. Higher values may increase the time to convert (if necessary) / load, and the used memory. It would add 44% more pixels, which already helped reading small text in some test images I have here.
I noticed that the recently added "Find Similar Faces" feature uses image dimensions to calculate the correct position of the detected face to be displayed in the image viewer. After making this changes, I will test it to check if the coordinates calculation is still working as expected.
Great @tc-wleite, this all was really a very careful investigation, thank you very much!
I think 2 is clearly a bug and could have a separate issue for it. Others are great improvements for me and I agree with all of them.
About 4, current dpi for pdf to img for ocr is 250. A long time ago I did some tests and thought 250 a good compromise between accuracy and speed, and I was aware about 300 dpi recommended by tesseract. I think we should use the same here or there, 250 or 300, not sure what is better today.
Great @tc-wleite, this all was really a very careful investigation, thank you very much!
I think 2 is clearly a bug and could have a separate issue for it. Others are great improvements for me and I agree with all of them.
Thanks @lfcnassif!
I thought about creating a separated issue, but as this one is already an "extra issue" that came from #515, I ended up putting everything here, although the second item is not directly related to the others as you noticed.
About 4, current dpi for pdf to img for ocr is 250. A long time ago I did some tests and thought 250 a good compromise between accuracy and speed, and I was aware about 300 dpi recommended by tesseract. I think we should use the same here or there, 250 or 300, not sure what is better today.
I made some tests with the visualizer (still not the OCR), and 300 was clearly better than 200. I will check if 250 is already enough in most cases, as a higher value also means a bit slower process.
After implements these changes, I found other two minor problems. As this issue already mix too many things (sorry about that), I will create other issues.
Added other minor adjustments to external conversion process:
-resize
together with -sample
(currently only the latter is used). This increased the quality of many externally converted images, using similar processing time.480x480>
(adding a "greater than" in the end), the image will only be reduced to 480, not enlarged (in the case of small images).MAGICK_AREA_LIMIT
from 10MP to 32MP.I noticed some while ago this behavior of small thumbs being enlarged to max thumbnail size (your second point above) but not sure if it is happening to external conversion only...
I noticed that the recently added "Find Similar Faces" feature uses image dimensions to calculate the correct position of the detected face to be displayed in the image viewer. After making this changes, I will test it to check if the coordinates calculation is still working as expected.
I managed to configure Python/JEP and ran "Find Similar Faces" here. Very cool feature!!! I got impressive results in a smaller test case.
I just had to make a small adjustment in the ImageViewer, as after the changes described here sometimes the ratio between the original image and the one loaded/displayed in the viewer is not an integer as before (old "sampling" variable).
I managed to configure Python/JEP and ran "Find Similar Faces" here.
Great @tc-wleite! Please, if you had to configure something to make Python/JEP work different than the steps documented in the manual, please let me know! Or feel free to complement them, I think the wiki is publicly editable.
The "Find Similar Faces" feature actually could be changed to work without JEP, as the hard work is being done in external python processes, this is a possible improvement to easy its usage. But maybe we could make JEP easier to be installed/configured by users, thinking about other future deeplearning modules, not sure if it is possible...
I managed to configure Python/JEP and ran "Find Similar Faces" here.
Great @tc-wleite! Please, if you had to configure something to make Python/JEP work different than the steps documented in the manual, please let me know! Or feel free to complement them, I think the wiki is publicly editable.
Most of the information was already clear in the manual. There were a couple of steps that I had some trouble, but it can be something that depend on my environment. Anyway, I will try to add one or two remarks that can be useful to other people.
I managed to configure Python/JEP and ran "Find Similar Faces" here.
Great @tc-wleite! Please, if you had to configure something to make Python/JEP work different than the steps documented in the manual, please let me know! Or feel free to complement them, I think the wiki is publicly editable.
@lfcnassif, I just added a paragraph to Python/JEP section in the Wiki with a couple of observations. Not sure if I picked the best place, formatting and wording, so feel free to make any adjustments.
Thanks @tc-wleite, maybe I will just omit your python version 3.9 because I think it is not compatible with tensorflow 2.x, needed by YahooNSFWNudityTask, so users will not be encouraged to use it.
Thanks @tc-wleite, maybe I will just omit your python version 3.9 because I think it is not compatible with tensorflow 2.x, needed by YahooNSFWNudityTask, so users will not be encouraged to use it.
Oh, that is important! I did install tensorflow, but don't remember which version. But I didn't test YahooNSFWNudityTask. I will run a quick test to confirm it is not working with Python 3.9.5, and will update my comment in the Wiki (removing the reference to Python 3.9.5).
Thanks @tc-wleite, maybe I will just omit your python version 3.9 because I think it is not compatible with tensorflow 2.x, needed by YahooNSFWNudityTask, so users will not be encouraged to use it.
Oh, that is important! I did install tensorflow, but don't remember which version. But I didn't test YahooNSFWNudityTask. I will run a quick test to confirm it is not working with Python 3.9.5, and will update my comment in the Wiki (removing the reference to Python 3.9.5).
Using Python 3.9.5, Pillow 8.2.0, Keras 2.4.3 and TensorFlow 2.5.0 (latest versions of these modules), YahooNSFWNudityTask did work in a small test (~200 images) here (produced a meaningful scores). But it was kind of slow and there were many errors on the execution log, which were probably related to using these newer versions. For now, I removed the reference to Python 3.9.5 from the Wiki.
Thanks Wladimir. It is somewhat slow indeed. Now that I'm a bit more experienced with python, I plan to try improving #357
Closed by #583, thanks @tc-wleite!
- Increase
MAGICK_AREA_LIMIT
from 10MP to 32MP.
hey @tc-wleite just to confirm you saw this relates to width x heigh, right? If 4 bytes are used per pixel, this could result in 128MB per image in memory, just thinking about many images being processed at the same time... did you get good performance improvements with this in your tests?
- Increase
MAGICK_AREA_LIMIT
from 10MP to 32MP.hey @tc-wleite just to confirm you saw this relates to width x height, right? If 4 bytes are used per pixel, this could result in 128MB per image in memory, just thinking about many images being processed at the same time... did you get good performance improvements with this in your tests?
Yes, pixels. It is an upper limit, so not all conversions would necessarily require this amount of memory, right? And the input image needs to be large, as now we are using ">" (smaller images won't be enlarged). But in the worst case (many large images, that require external conversion, being processed in the same time) I guess it would be (numThreads x 128 MB), right? That still doesn't seem that much, as the memory is used in an external process (which starts and finishes after a few seconds).
For thumbnail generation, there shouldn't be any difference, as sample parameter should reduce the number of pixels loaded a lot. For the viewer (and for the OCR when #515 is implemented), this can make quite some difference, in some cases.
EMF (and other vector formats) seem to be more sensible to this parameter. As they are basically commands to draw over a canvas, if not all pixels can fit in the memory, it will need to keep swapping from temporary files to memory and back.
I found a case (a EWF file) that the conversion for the viewer (higher resolution) takes 46 seconds with 10 MP, and only 3 seconds with 12 MP. These measures were made using command line option "-bench", and without anything else running (no disk concurrency, that would normally happen).
These pathological cases are not common, at all. I increased to 32 MP to be safe, and because OCR will probably require a bit more pixels than for viewing. For the viewer the new dimension limit is 2400. An A4 page, with 250 dpi, would have ~2925 pixels in the largest dimension. I will run tests, but my initial idea is to use this number (rounded up to 3000) as the maximum dimension when converting to OCR.
My conclusion: for most images, 10 MP is fine, but for some of them, increasing this parameter can make a huge difference. A smaller increase (to 16 MP or 24 MP) would probably cover most of the cases, so if you are concerned about overall memory usage, we can move to an intermediate value.
Another detail that I forgot to mention, as external conversion for thumbnail generation is not affected by this parameter, I noticed this behavior visualizing some images. In tests with #515, this parameter should make more difference. So it will be possible to measure the performance with different values. I will update here when I have actual numbers (it will take a few days though).
Yes, pixels. It is an upper limit, so not all conversions would necessarily require this amount of memory, right?
I think so.
But in the worst case (many large images, that require external conversion, being processed in the same time) I guess it would be (numThreads x 128 MB), right?
Right.
For thumbnail generation, there shouldn't be any difference, as sample parameter should reduce the number of pixels loaded a lot.
Do you know if -sample parameter works for thumbnail generation of large vectorial images?
I found a case (a EWF file) that the conversion for the viewer (higher resolution) takes 46 seconds with 10 MP, and only 3 seconds with 12 MP.
Wow, that's a huge difference!
An A4 page, with 250 dpi, would have ~2925 pixels in the largest dimension. I will run tests, but my initial idea is to use this number (rounded up to 3000) as the maximum dimension when converting to OCR.
So the image for OCR initially would use at most 9MP in memory right? And bigger images would be cached to disk temporarily as I understood with old value.
My conclusion: for most images, 10 MP is fine, but for some of them, increasing this parameter can make a huge difference. A smaller increase (to 16 MP or 24 MP) would probably cover most of the cases, so if you are concerned about overall memory usage, we can move to an intermediate value.
I just thought about externalizing this parameter to ImageThumbConfig.txt file, so users with very hard memory limits could decrease it. Decreasing image conversion time from 46s to 3s, even in corner cases, is a big difference! But actually the new 32MP value here is much better than the change we did on #81, I think it has no upper bound limit for memory usage, a mixed approach like MAGICK_AREA_LIMIT would be better for internal conversion too...
Do you know if -sample parameter works for thumbnail generation of large vectorial images?
It does work, but I am not sure what happens internally.
An example of time measurements using -bench 10
(average of 10 executions) on a large EMF file:
-density 96 -sample "320x320>" -resize "160x160>" : 330 ms
-density 96 -resize "160x160>" : 392 ms
-density 250 -sample "320x320>" -resize "160x160>" : 651 ms
-density 250 -resize "160x160>" : 1072 ms
It is clear that without -sample
it is slower, but -density
is also very important.
So the image for OCR initially would use at most 9MP in memory right? And bigger images would be cached to disk temporarily as I understood with old value.
There is another detail with using -sample
directly to the desired dimensions.
It produces poor quality images in some cases.
In internal conversion, we already use a factor of 3 for sampling, before resizing.
For the external conversion there were no such factor.
However, thumbnail generation requested a larger image (3x the target dimension).
I changed this. Now both thumbnail generation and the viewer request the actual dimension for external conversion.
And added a factor of 2 for sampling, and then resizing.
That produced much better images for the viewer (as the example above in this thread).
So images for the viewer now use -density 250 -sample "4800x4800>" -resize "2400x2400>"
.
For OCR (using 3000 as the target dimension) it could reach 36MP for a large square image.
I just thought about externalizing this parameter to ImageThumbConfig.txt file, so users with very hard memory limits could decrease it. Decreasing image conversion time from 46s to 3s, even in corner cases, is a big difference! But actually the new 32MP value here is much better than the change we did on #81, I think it has no upper bound limit for memory usage, a mixed approach like MAGICK_AREA_LIMIT would be better for internal conversion too...
You are right, externalizing seems the a great idea. As I said, I want to come up with a better default value after tests with OCR, but it is better to allow changing it. Even the densities could be externalized. Something like:
lowResDensity = 96
highResDensity = 250
What do you think?
A last detail, I still need to test, but I am considering using a lower density value for low resolution. The performance gain is very small for images that I tested, but it could be meaningful in corner cases. For thumbnails, 96 is in fact more than enough.
There is another detail with using
-sample
directly to the desired dimensions. It produces poor quality images in some cases. In internal conversion, we already use a factor of 3 for sampling, before resizing. For the external conversion there were no such factor. However, thumbnail generation requested a larger image (3x the target dimension).I changed this. Now both thumbnail generation and the viewer request the actual dimension for external conversion. And added a factor of 2 for sampling, and then resizing. That produced much better images for the viewer (as the example above in this thread). So images for the viewer now use
-density 250 -sample "4800x4800>" -resize "2400x2400>"
. For OCR (using 3000 as the target dimension) it could reach 36MP for a large square image.
Great, perfect! Thank you @tc-wleite, now I understood.
You are right, externalizing seems the a great idea. As I said, I want to come up with a better default value after tests with OCR, but it is better to allow changing it. Even the densities could be externalized. Something like:
lowResDensity = 96 highResDensity = 250
What do you think?
I think externalizing memory limit and densities is great! This makes sense because density for PDF->IMAGE before OCR is already externalized.
A last detail, I still need to test, but I am considering using a lower density value for low resolution. The performance gain is very small for images that I tested, but it could be meaningful in corner cases. For thumbnails, 96 is in fact more than enough.
Tried lower density values (48 and 72), the performance gain for vector images (in a large set) was small (~5% for 48, ~2% for 72), and the thumbnail quality was much worse, so I will keep it as it was (96).
I ran a larger test (~50K images and PDFs, from different cases and formats, most of them containing text), including OCR (after implementing #515), and using different values of MAGICK_AREA_LIMIT.
MAGICK_AREA_LIMIT (MP) 10 16 32 64
--------------------------------------------------
OCR Total Time (s) 2705 2579 2560 2572
Image Thumb Time (s) 550 529 527 577
Successful Thumbs 50609 50649 50662 50657
Thumb Timeouts 98 58 45 50
OCR Timeouts 5 4 4 4
The overall difference was small, but from 10MP (previous setting) to 32MP, more images had their thumbs generated and performance of thumb generation task and OCR (part of the images required external conversion) was slightly better. So I would keep 32MP as the default values for now.
Great @tc-wleite! As you have already collected a good dataset and run those tests, is it possible to run a last test using 300dpi to image/pdf conversion before OCR? Did results above use 250dpi?
I used 3000 pixels as the maximum dimension for converted image, which is approximately an A4 page with 250 dpi. This is a parameter in OCR configuration. It only reduces images while converting, so it won't enlarge smaller images. For vectorial images, I used 250 as the density parameter. I didn't use explicit DPI calculation because many images have incomplete or unreliable information.
I will repeat this same test, using 300 as the density for OCR conversion, and 3500 pixels as the maximum dimension (~ a page with 300 dpi). I expect that "OCR Total Time" will grow, as about 1/3 of these test images require external conversion, and the time spent by tesseract should increase too.
Thanks, Wladimir! It would be very interesting the running time and common words hits numbers comparison.
With 300, OCR total time went from 2560 s to 2814 s. A comparison of the number of hits for the words you mentioned in #206.
250 300
------ ----- -----
para 5069 5031
com 5961 5954
de 9796 9808
em 5569 5476
até 1504 1493
você 686 686
nós 1570 1569
não 3101 3091
you 197 198
they 17 17
are 145 142
to 2079 2061
of 7241 7238
------ ----- -----
TOTAL 42935 42764
Thanks @tc-wleite! Interesting, so let's keep 250dpi/3000pixels.
While working on #515 I noticed a few things that could be improved in thumbnail generation and image viewer for some image formats.
1. External conversion geometry
External conversion (ImageMagick) currently uses the
-sample
parameter to limit the image size, both when generating thumbnails and for viewing images that are not handled by the "internal" (ImageIO) process. This parameter receives a "geometry". Currently just a single value is used, which means the width. As it scales the image keeping its aspect ratio, this process can generate images larger than necessary for portrait input images.My suggestion is to limit both dimensions, as it is done in the "internal" process. This is easily supported by ImageMagick (and GraphicsMagick), just using
-sample 480x480
instead of-sample 480
.An example of a pathological case (2 x 200 pixels), which would generate a 480 x 48000 pixels (during thumbnail generation process, if it requires external conversion):