ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
12.78k stars 935 forks source link

[Bug]: Existing text is completely replaced with other characters #1337

Open david-sledge opened 2 weeks ago

david-sledge commented 2 weeks ago

Describe the bug

Found an issue with certain PDFs that already have text where the text is replaced with other characters and renders the PDFs unreadable. This happens with the --redo-ocr and --skip-text flags. Attached are (a) a sample PDF (b) the results of it being OCRed, and (c) a zip file containing everything needed to reproduce the issue.

Steps to reproduce

1. Download the tarball to a linux machine with Docker installed.
2. Run the following command chain: tar -xzf bad-pdf-example.tar.gz && cd bad-pdf-example && docker run --rm -v .:/root/test-files -it $(docker build -q -t ocrmypdf-test .) && docker rmi ocrmypdf-test:latest
3. Open test-redo-ocr-result.pdf and test-skip-text-result.pdf

Files

test.pdf test-redo-ocr-result.pdf test-skip-text-result.pdf bad-pdf-example.tar.gz

How did you download and install the software?

Linux package manager (apt, dnf, etc.), Docker container

OCRmyPDF version

16.3.1

Relevant log output

tesseract 5.4.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.17
OCRmyPDF version:
16.3.1
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 skipping all processing on this page                                                                                                                                                                                      _pipeline.py:330
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1515.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)                                                                                                                                                                                           _common.py:441
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 redoing OCR                                                                                                                                                                                                               _pipeline.py:327
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                                                     ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                             _metadata.py:62
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%                                                                                                                                                                                    _pipeline.py:989
Total file size ratio: 0.06 savings: -1554.7%                                                                                                                                                                                   _pipeline.py:992
Output file is a PDF/A-2B (as expected)
jbarlow83 commented 1 week ago

The problem with this file is that does not embed the fonts it uses. In this case, Arial Bold and Arial Bold Italic. It was previous processed by Nitro Pro 13.

When Ghostscript (which OCRmyPDF uses), it replaces the missing with a substitute, using "DroidSansFallback". The kerning of the substitute is different, so the PDF viewer sees spaces between letters. At least for me. I don't know how an Asian font was substituted in your version.

ocrmypdf --output-type pdf avoids Ghostscript, and produces a usable result.

Try doing gs -sDEVICE=pdfwrite -o output.pdf test.pdf and see if you can reproduce the Japanese-Korean version, then reporting to Ghostscript. I won't report because there's potentially personal information in the test file that is not mine.

jbarlow83 commented 1 week ago

ocrmypdf --force-ocr would also fix this file completely, with or without Ghostscript.

I am considering adding a warning about Ghostscript font substitution, especially if someone else encounters this. Ghostscript has had several issues with mangling text recently.

beshtim commented 1 week ago

I think i have got the same problem

Снимок экрана 2024-06-25 121753 Снимок экрана 2024-06-25 122040

I am running with --redo-ocr also and this hieroglyphs appeas sometimes