ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.04k stars 1.01k forks source link

Dockerfile using CentOS - Bad OCR results #335

Closed DanyD closed 5 years ago

DanyD commented 5 years ago

Describe the issue We need to get ocrmypdf running within a CentOS based image. For this we installed all dependencies and ocrmypdf runs - but the OCR results are not comparable to a test with the Ubuntu based docker container.

We tried to analyze the intermediate files created during the conversion and it seems to be something regarding the preprocessing / auto-rotation as tesseract single run resulted in a comparable result on both containers.

We also tried several versions of the dependencies but they seem to have no impact on this issue. So currently we do not see any piece missing for ocrmypdf nor does it give any error messages indicating that some requirements were not met. So the question is if ocrmypdf has some know compatibility issues with CentOS or if we have overseen something in our tests.

To Reproduce Build the CentOS based container and running ocrmypdf /input/beleg1_clean.jpg /input/beleg1_clean.pdf --image-dpi 72 -l deu --sidecar sidecar.txt -k -v

See the both sidecar.txt files as attachment containing the OCR result of 1) the CentOS based image dn 2) the Ubuntu based image

Dockerfile for testing (includes custom compiles for latest versions as well as default centos packages for the main dependencies (commented out)):

FROM centos:7

RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

# Use de_DE.UTF-8 as our locale
RUN echo "de_DE.UTF-8 UTF-8" >> /etc/locale.gen && \
    localedef --quiet -c -i de_DE -f UTF-8 de_DE.UTF-8

ENV LANG de_DE.utf-8
ENV LANGUAGE de_DE
ENV LC_ALL de_DE.utf-8

# === Install OCRmyPDF + Dependencies ===================================================

## Install BUILD Tools
RUN yum -y --nogpgcheck install libxslt gcc gcc-c++ make autoconf wget automake libtool unzip openssl-devel bzip2-devel  libffi-devel

## Tesseract 4 (Centos has only 3 by default)
RUN yum-config-manager --nogpgcheck --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
RUN yum -y install --nogpgcheck tesseract tesseract-langpack-deu

# Install ocrmypdf Prerequisits (using default Centos Versions

# qpdf 5.0, GS 9.07, unpage 0.3
# RUN YUM install -y --nogpgcheck qpdf ghostscript unpage 
RUN yum install -y --nogpgcheck pngquant jbig2dec 

WORKDIR /tmp

# Use Python 3.6 from repo
RUN yum install -y --nogpgcheck https://centos7.iuscommunity.org/ius-release.rpm 
RUN yum install -y --nogpgcheck python36u python36u-libs python36u-devel python36u-pip
RUN pip3.6 install --upgrade pip

# Compile Unpaper 6 
# FFMPEG Repository
RUN yum -y --nogpgcheck install http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-0-5.el7.nux.noarch.rpm
RUN yum -y --nogpgcheck install ffmpeg-devel libjpeg-devel libpng-devel libtiff-devel zlib-devel ocaml ImageMagick ImageMagick-devel
# Build
RUN wget https://www.flameeyes.eu/files/unpaper-6.1.tar.xz \
    && tar xvf unpaper-6.1.tar.xz \
    && cd unpaper-6.1 \
    && ./configure \
    && make && make install 

# Compile Ghostscript 9.26
RUN wget https://github.com/ArtifexSoftware/ghostpdl-downloads/releases/download/gs926/ghostscript-9.26-linux-x86_64.tgz \
    && tar xvf ghostscript-9.26-linux-x86_64.tgz \
    && cd ghostscript-9.26-linux-x86_64 \
    && cp gs-926-linux-x86_64 /usr/bin/gs 

# Compile QPDF 8
RUN wget https://github.com/qpdf/qpdf/releases/download/release-qpdf-8.3.0/qpdf-8.3.0.tar.gz \
    && tar xvf qpdf-8.3.0.tar.gz \
    && cd qpdf-8.3.0 \
    && ./configure \
    && make && make install

# Install ocrmypdf
RUN pip3.6 install ocrmypdf

RUN mkdir /input
WORKDIR /input
CMD ["bash"]

Example file Please include an example input PDF (or image). The input file is more helpful.

Please check any or all that apply about the test file:

Files that are not free for inclusion in this project are quite welcome, but we like to collect free files for our test suite when possible. Please do not submit files with confidential information. At your option you may encrypt files for OCRmyPDF's author only.

System:

beleg1_clean

OCR Output CentOS: sidecar_clean_centos.txt

OCR Output Ubuntu (Default ocrmypdf Image): sidecar_clean_docker.txt

jbarlow83 commented 5 years ago

If that is your input image you'll definitely need to do some preprocessing. Specifically you will need to do a perspective correction transform.

The simpler thing to do is find the four corners and dewarp. That wouldn't work for this specific image because the shape of interest is a polygon and not a whole page. For that you need the "grid" in projected space (if it were a piece of graph, at pixel values would each point appear). In some cases you may also need to do distortion correction to invert the effect of the camera lens.

(Yes, I love this stuff.)

Photoshop and GIMP can do this visually if you just want to test the effect on OCR.

ocrmypdf's --threshold feature may help for a perspective transformed version.

For example: https://stackoverflow.com/a/6644246/369072

https://stackoverflow.com/questions/49750110/opencv-detecting-a-rectangle-on-a-photo-of-paper-with-inner-elements

Note if this is going to be part of an online service you will want to ensure that your work complies with Ghostscript's AGPLv3 license.

DanyD commented 5 years ago

Yes, this was one of our test images. You are right - the image itself is just a simple image from a mobile camera including some nice distortions :-) I started reading into opencv last night - really cool stuff out there and a lots of ideas.

I will give the --threshold a try later on. The main point in my observations was, that the default docker image on ubuntu you have created has no issues with this file - your tool works like a charm :-) But although I have setup the CentOS based image with the same prerequisits (as far as I am aware of) the result is absolutely different and produces a useless OCR result. So the question is which part of the pre-processing in your tool might be different between both "versions" to isolate the cause of error. I tested also a few files with a correct rotation and these worked fine on both linux flavours - so I am pretty sure it is one of the pre-processing tools in your tool chain which fails or produces a different result on CentOS.

I have the debug folder with all intermediate images and files and logs - but what I am missing is an ordered list of how these files are being used to compare all operational steps to find the part or dependency which causes different results between both plattforms. As far as I have seen not all files are referenced in the log file so I am currently not able to fully proof the processing order.

Thanks for the AGPL reference. We are currently just testing some scenarios for one of our customers but this is a good point to keep in mind for a later project.

jbarlow83 commented 5 years ago

Oh, well, that's very interesting.

There are enough complex dependencies that I wouldn't expect reproducibly, but it should be better than what you're seeing.

I am left with the distinct impression that CentOS is seeing the image upside down. At the bottom of its text it reads "4anaIS" which looks like "Steuer" upside down; uuewebbnig UIMEN looks like "Martin Brüggemann". You have to use your imagination a bit, but there's a sort of correspondence, especially for Brüggermann -> uuewebbnig.

Yes, the pipeline is not quite documented. It is something of an implementation detail. src/ocrmypdf/_pipeline.py, near the bottom, describes the pipeline, with file extension changes usually given as a suffix usually. The main filename is a page number. The images/ subfolder is for optimization only.

Here are the important ones:

.page.png - what the input page looks like .image - the image we will show the user if we are in a mode that changes the final appearance; so named, because it may be in one of several image formats .text.pdf - the OCR file; this will load as a blank page but should have visible text if checked with a tool like pdftotext or pdfminder.six .ocr.png - the file that is sent to Tesseract for OCR

Sometimes these may be symlinks to other files or missing depending what is going.

I would start with feeding some .ocr.png files to tesseract.

DanyD commented 5 years ago

Alright - I did some more tests and compiled a file with the intermediate files from your TMP folder. From my first analysis.

https://blueend-my.sharepoint.com/:u:/p/daniel_wilhelm/EQRyqSI3q3tHvBDMHubVNSUBjyVeUN6wIAywXuOTubNP-Q?e=3o9piW

What I can see is that some files exist in one tmp while not in the other. But the pdf files and images seem to be similiar across both plattforms. Very strange. If I got you right the 000001.ocr.png ist the input for tesseract, right? I have no clue yet what the difference might be.

One interesting note: If I feed the input image directly to tesseract using psm 1 the result is correct. So I am sure there is something in the pre-processing causing trouble and making the result in this example worse than Tesseract standalone.

jbarlow83 commented 5 years ago

I apologize - I wrote this earlier but forgot to sent it. It's a thing I do...

Ubuntu is running ocrmypdf 6.1.2 from your log files. That is likely the main difference. That of course raises the question why the older version gives an apparently better result.

In 6.1.2 ocrmypdf ran Ghostscript with the default value for -dAutoRotatePages. Newer versions specify -dAutoRotatePages=/None. The reason for this change is that Ghostscript's autorotation is unpredictable and interferes with the --rotate-pages feature.

If --rotate-pages is used, ocrmypdf will not rotate this image because it is not confident enough about the orientation:

Page number: 0
Orientation in degrees: 180
Rotate: 180
Orientation confidence: 9.95  <--- confidence too low to rotate, default is 15
Script: Latin
Script confidence: 4.29

Lowering the threshold with --rotate-pages-threshold will work for this file, but likely give you a lot of false positive rotations. The original image without blur may well work correctly, though, because there would be more text to establish the orientation with greater confidence.

If the image is manually rotated to the correct orientation, ocrmypdf 8.0 gives good OCR results.

jbarlow83 commented 5 years ago

I'll close the issue now. If you have further related questions feel free to reopen it.