Open dkarsai opened 2 days ago
Here is an example of the kind of document I'm comparing. (original image is .tif, but I can't attach .tif to github)
Hi @dkarsai , Thanks for raising the issue, I'll look into it. In general I can say: The mechanism to increase the dpi for an image for ocr is done, because ocr with tesseract works better in resolutions around 300 Dpi. Even when a 72 dpi image is simple resized, it already improved the results in my experiments. I will look into your examples later, currently just on my mobile
Hi, I'm hoping I can get some help with a problem I'm having. The library might be working as intended, but I'm unsure what is happening under the hood exactly, so any input is welcome.
I noticed mask placement based on regexes was sometimes failing. I found a suspicious line in the logs:
I found this strange because the images I'm comparing are around 200 DPI.
I found in the code where the self.DPI is set to 72 (e.g.: DocTest/CompareImage.py:476 load_image_into_array function), but I was unable to decipher what's the purpose of this is.
I looked at the output of OCR (text extracted with Get Text From Document keyword) and found that the recognized text is indeed incorrect. Example: A70578524 was recognized as A/03/8524
When extracting the text with Get Text From Document keyword, the same line reappeared:
So I set increase_resolution=false to prevent re-rendering and the output OCR was now as expected: 'A70578524'
I experimented some more and set MINIMUM_OCR_RESOLUTION to 72 in DocTest/CompareImage.py:34. This prevented re-rendering from being triggered when using the Compare image keyword and all masks were placed correctly.
So I think the re-rendering is introducing some issue with identifying the text correctly with OCR, causing the masks to not get applied.
Why is the DPI of the image being set to 72 in the code? Why won't it recognize the correct DPI of the image and not re-render? Implementing the ability to set MINIMUM_OCR_RESOLUTION when calling the Compare Images keyword would provide a solution for the issue, but I did not want to suggest any changes until I understand the issue completely.
Thank you in advance!