ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

[Bug]: OCRmyPDF not adding any text to document v 1.4 #1251

Closed maxi07 closed 9 months ago

maxi07 commented 9 months ago

Describe the bug

My scanner is creating files with a PDF version 1.4. With OCRmyPDF version 13.0.4 the file is scanned totally fine. Using the latest version its missing the the text on the file. Its the same file from the test directory. If I would use the file from the testdirectory directly, ocr would work. I printed the file and scanned it, in this case ocr runs without any errors and exits with code 0, but there is no text to be found on the document. Therefore I am thinking it might have trouble with the pdf version?

output of pdfinfo:

CreationDate:    Sun Feb 11 23:07:12 2024 CET
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           1
Encrypted:       no
Page size:       841.68 x 595.2 pts (A4)
Page rot:        270
File size:       612818 bytes
Optimized:       no
PDF version:     1.4

Steps to reproduce

1. Run `ocrmypdf ohnetext.pdf mittext.pdf -v`
2. Open file, its without text.

Files

ohnetext.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.0.4

Relevant log output

ocrmypdf 16.0.4                                                                                                                               __main__.py:59
Running: ['tesseract', '--version']                                                                                                          __init__.py:134
Found tesseract 4.1.1                                                                                                                        __init__.py:343
Running: ['tesseract', '--version']                                                                                                          __init__.py:134
Running: ['gs', '--version']                                                                                                                 __init__.py:134
Found gs 9.55.0                                                                                                                              __init__.py:343
Running: ['gs', '--version']                                                                                                                 __init__.py:134
Running: ['tesseract', '--list-langs']                                                                                                       __init__.py:134
stdout/stderr = List of available languages (3):                                                                                              __init__.py:74
deu
eng
osd

No language specified; assuming --language eng                                                                                             _validation.py:61
pikepdf mmap enabled                                                                                                                          helpers.py:325
os.symlink(ohnetext.pdf, /tmp/ocrmypdf.io.yuxx2ef8/origin)                                                                                    helpers.py:178
os.symlink(/tmp/ocrmypdf.io.yuxx2ef8/origin, /tmp/ocrmypdf.io.yuxx2ef8/origin.pdf)                                                            helpers.py:178
Gathering info with 1 thread workers                                                                                                             info.py:773
pikepdf mmap enabled                                                                                                                          helpers.py:325
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                                                                   tesseract_ocr.py:179
pikepdf mmap enabled                                                                                                                          helpers.py:325
    1 Rasterize with png16m, rotation 0                                                                                                     _pipeline.py:527
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1',           __init__.py:134
'-dLastPage=1', '-r300.000000x300.000000', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f',
'/tmp/ocrmypdf.io.yuxx2ef8/origin.pdf']
    1 Rotating output by 0                                                                                                                ghostscript.py:149
    1 resolution (299.9994, 299.9994)                                                                                                       _pipeline.py:606
    1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.yuxx2ef8/000001_ocr.png',                                __init__.py:134
'/tmp/ocrmypdf.io.yuxx2ef8/000001_ocr_tess', 'pdf', 'txt']
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 270) -> 90                                                      _graft.py:140
    1 Grafting                                                                                                                                 _graft.py:251
    1 Grafting with ctm pikepdf.Matrix(-1.83697e-16, -1, 1, -1.83697e-16, -5.68434e-14, -841.68)                                               _graft.py:295
    1 Page rotation: (content, auto) -> page = (270, 0) -> 270                                                                                 _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                 ocr.py:147
os.symlink(/tmp/ocrmypdf.io.yuxx2ef8/graft_layers.pdf, /tmp/ocrmypdf.io.yuxx2ef8/fix_docinfo.pdf)                                             helpers.py:178
Running: ['gs', '--version']                                                                                                                 __init__.py:134
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',               __init__.py:134
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true',
'-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.yuxx2ef8/fix_docinfo.pdf',
'/tmp/ocrmypdf.io.yuxx2ef8/pdfa.ps']
GPL Ghostscript 9.55.0 (2021-09-27)                                                                                                          __init__.py:109
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.                                                                              __init__.py:109
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                   __init__.py:109
see the file COPYING for details.                                                                                                            __init__.py:109
Processing pages 1 through 1.                                                                                                                __init__.py:109
Page 1                                                                                                                                       __init__.py:109
Running: ['tesseract', '--version']                                                                                                          __init__.py:134
xref 18: treating as an optimization candidate                                                                                               optimize.py:273
XrefExt(xref=18, ext='.png')                                                                                                                 optimize.py:338
Optimizable images: JPEGs: 0 PNGs: 1                                                                                                         optimize.py:343
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 18: treating as an optimization candidate                                                                                               optimize.py:273
xref 18: marking this JPEG as deflatable                                                                                                     optimize.py:538
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
xref 18: treating as an optimization candidate                                                                                               optimize.py:273
xref 18: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                                                      optimize.py:98
Optimizable images: JBIG2 groups: 0                                                                                                          optimize.py:354
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.yuxx2ef8/optimize.opt.pdf, /tmp/ocrmypdf.io.yuxx2ef8/optimize.pdf)                                                helpers.py:178
Running: ['jbig2', '--version']                                                                                                              __init__.py:134
Running: ['pngquant', '--version']                                                                                                           __init__.py:134
Image optimization ratio: 1.17 savings: 14.3%                                                                                               _pipeline.py:915
Total file size ratio: 1.15 savings: 12.9%                                                                                                  _pipeline.py:918
/tmp/ocrmypdf.io.yuxx2ef8/optimize.pdf -> mittext.pdf                                                                                       _pipeline.py:990
Output file is a PDF/A-2B (as expected)
maxi07 commented 9 months ago

This seems to be resolved by v16.1.1 where Python3.10 support was fixed. Thanks!