ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.78k stars 1k forks source link

[Bug]: File generated by OCRmyPDF doesn't open in all PDF editors #1400

Open sklart opened 3 days ago

sklart commented 3 days ago

Describe the bug

Hello! There was an interesting problem with a file Сертификат качества №3490 (опора 1У110-5+10, 2 шт.).pdf.

It opens only in PDF-XChange Editor 10.4.1.389, and does not open in either Foxit PDF Editor Pro 12.1.1.15289 or Acrobat Pro 2023.006.20360 (64-Bit). It seems to me that pikepdf saves in some pdf format that is not supported by other editors.

The algorithm for creating PDF is as follows:

  1. Scan to MFP.
  2. Manual processing in PDF-XChange Editor: rotating pages, changing page sizes to A4, A3 formats (sometimes PDFs with non-standard dimensions of several hundred mm are received from the MFP).
  3. Next, in all PDFs in the folder I run optical text recognition using the ocrmypdf program (pikepdf is used by it to process pdf) with the command FOR /r %F IN (*.pdf) DO ocrmypdf -l eng+rus --rotate-pages --skip-text --optimize 1 --output-type pdf "%F" "%~fF"

After converting a page of a document in PDF-XChange Editor into an image and then saving it back to PDF, the file can be read by all editors.

But I still wonder what could be the reason for this behavior? Maybe someone can tell me?

P.S. I initially wrote about this behavior in the pikepdf issues, but they explained that this is not a consequence of the work of this program and this is more likely to turn out to be an OCRmyPDF issue.

Steps to reproduce

No response

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

ocrmypdf 16.4.3

Relevant log output

No response

jbarlow83 commented 2 days ago

Please provide the input file, before processing with ocrmypdf.

sklart commented 2 days ago

Using cloud storage, I was able to download different versions of the file during its processing. The strange behavior starts with the file dated 2024.08.07 10.31.21 Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (final ver. 2024.09.30 13.56.21).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 8.51.55).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 9.47.03).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 9.51.52).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 10.02.30).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 10.31.21).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 10.54.37).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 14.11.12).pdf Сертификат качества №3490 (опора 1У110-5+10, 2 шт.) (ver. 2024.08.07 14.27.50).pdf