Closed DanyD closed 5 years ago
If that is your input image you'll definitely need to do some preprocessing. Specifically you will need to do a perspective correction transform.
The simpler thing to do is find the four corners and dewarp. That wouldn't work for this specific image because the shape of interest is a polygon and not a whole page. For that you need the "grid" in projected space (if it were a piece of graph, at pixel values would each point appear). In some cases you may also need to do distortion correction to invert the effect of the camera lens.
(Yes, I love this stuff.)
Photoshop and GIMP can do this visually if you just want to test the effect on OCR.
ocrmypdf's --threshold
feature may help for a perspective transformed version.
For example: https://stackoverflow.com/a/6644246/369072
Note if this is going to be part of an online service you will want to ensure that your work complies with Ghostscript's AGPLv3 license.
Yes, this was one of our test images. You are right - the image itself is just a simple image from a mobile camera including some nice distortions :-) I started reading into opencv last night - really cool stuff out there and a lots of ideas.
I will give the --threshold a try later on. The main point in my observations was, that the default docker image on ubuntu you have created has no issues with this file - your tool works like a charm :-) But although I have setup the CentOS based image with the same prerequisits (as far as I am aware of) the result is absolutely different and produces a useless OCR result. So the question is which part of the pre-processing in your tool might be different between both "versions" to isolate the cause of error. I tested also a few files with a correct rotation and these worked fine on both linux flavours - so I am pretty sure it is one of the pre-processing tools in your tool chain which fails or produces a different result on CentOS.
I have the debug folder with all intermediate images and files and logs - but what I am missing is an ordered list of how these files are being used to compare all operational steps to find the part or dependency which causes different results between both plattforms. As far as I have seen not all files are referenced in the log file so I am currently not able to fully proof the processing order.
Thanks for the AGPL reference. We are currently just testing some scenarios for one of our customers but this is a good point to keep in mind for a later project.
Oh, well, that's very interesting.
There are enough complex dependencies that I wouldn't expect reproducibly, but it should be better than what you're seeing.
I am left with the distinct impression that CentOS is seeing the image upside down. At the bottom of its text it reads "4anaIS" which looks like "Steuer" upside down; uuewebbnig UIMEN looks like "Martin Brüggemann". You have to use your imagination a bit, but there's a sort of correspondence, especially for Brüggermann -> uuewebbnig.
Yes, the pipeline is not quite documented. It is something of an implementation detail. src/ocrmypdf/_pipeline.py, near the bottom, describes the pipeline, with file extension changes usually given as a suffix usually. The main filename is a page number. The images/ subfolder is for optimization only.
Here are the important ones:
.page.png
- what the input page looks like
.image
- the image we will show the user if we are in a mode that changes the final appearance; so named, because it may be in one of several image formats
.text.pdf
- the OCR file; this will load as a blank page but should have visible text if checked with a tool like pdftotext
or pdfminder.six
.ocr.png
- the file that is sent to Tesseract for OCR
Sometimes these may be symlinks to other files or missing depending what is going.
I would start with feeding some .ocr.png files to tesseract.
Alright - I did some more tests and compiled a file with the intermediate files from your TMP folder. From my first analysis.
What I can see is that some files exist in one tmp while not in the other. But the pdf files and images seem to be similiar across both plattforms. Very strange. If I got you right the 000001.ocr.png ist the input for tesseract, right? I have no clue yet what the difference might be.
One interesting note: If I feed the input image directly to tesseract using psm 1 the result is correct. So I am sure there is something in the pre-processing causing trouble and making the result in this example worse than Tesseract standalone.
I apologize - I wrote this earlier but forgot to sent it. It's a thing I do...
Ubuntu is running ocrmypdf 6.1.2 from your log files. That is likely the main difference. That of course raises the question why the older version gives an apparently better result.
In 6.1.2 ocrmypdf ran Ghostscript with the default value for -dAutoRotatePages
. Newer versions specify -dAutoRotatePages=/None
. The reason for this change is that Ghostscript's autorotation is unpredictable and interferes with the --rotate-pages
feature.
If --rotate-pages
is used, ocrmypdf will not rotate this image because it is not confident enough about the orientation:
Page number: 0
Orientation in degrees: 180
Rotate: 180
Orientation confidence: 9.95 <--- confidence too low to rotate, default is 15
Script: Latin
Script confidence: 4.29
Lowering the threshold with --rotate-pages-threshold
will work for this file, but likely give you a lot of false positive rotations. The original image without blur may well work correctly, though, because there would be more text to establish the orientation with greater confidence.
If the image is manually rotated to the correct orientation, ocrmypdf 8.0 gives good OCR results.
I'll close the issue now. If you have further related questions feel free to reopen it.
Describe the issue We need to get ocrmypdf running within a CentOS based image. For this we installed all dependencies and ocrmypdf runs - but the OCR results are not comparable to a test with the Ubuntu based docker container.
We tried to analyze the intermediate files created during the conversion and it seems to be something regarding the preprocessing / auto-rotation as tesseract single run resulted in a comparable result on both containers.
We also tried several versions of the dependencies but they seem to have no impact on this issue. So currently we do not see any piece missing for ocrmypdf nor does it give any error messages indicating that some requirements were not met. So the question is if ocrmypdf has some know compatibility issues with CentOS or if we have overseen something in our tests.
To Reproduce Build the CentOS based container and running ocrmypdf /input/beleg1_clean.jpg /input/beleg1_clean.pdf --image-dpi 72 -l deu --sidecar sidecar.txt -k -v
See the both sidecar.txt files as attachment containing the OCR result of 1) the CentOS based image dn 2) the Ubuntu based image
Dockerfile for testing (includes custom compiles for latest versions as well as default centos packages for the main dependencies (commented out)):
Example file Please include an example input PDF (or image). The input file is more helpful.
Please check any or all that apply about the test file:
Files that are not free for inclusion in this project are quite welcome, but we like to collect free files for our test suite when possible. Please do not submit files with confidential information. At your option you may encrypt files for OCRmyPDF's author only.
System:
OCR Output CentOS: sidecar_clean_centos.txt
OCR Output Ubuntu (Default ocrmypdf Image): sidecar_clean_docker.txt