ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.24k stars 1.02k forks source link

[Bug]: `lots of diacritics - possibly poor OCR` but using standalone tesseract works perfectly #1335

Closed KAGEYAM4 closed 5 months ago

KAGEYAM4 commented 5 months ago

Describe the bug

So i got 'lots of diacritics - possibly poor OCR', i ran the output pdf and tried selecting text, some text weren't being selected. So i tried using tesseract on them grimblast save area - | tesseract - - | wl-copy ; notify-send "$(wl-paste)", and tesseract was able to grab them successfully. Why using tesseract standalone worked but ocrmypdf didn't?

Steps to reproduce

[phoenix@ArchLinux Downloads]$ ocrmypdf test.pdf  output.pdf
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1  lots of diacritics - possibly poor OCR                                                                                                              tesseract.py:241
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                               
                                                                                 ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.06 savings: 5.8%                                                                                                               _pipeline.py:989
Total file size ratio: 0.98 savings: -2.0%                                                                                                                 _pipeline.py:992
Output file is a PDF/A-2B (as expected) 

Files

the pdf was 250pages, and i got error on all of the pages. I extracted the single page from pdf and ran ocrmypdf on them inorder to reduce the size of pdf that i would have to upload here.

test.pdf output.pdf

How did you download and install the software?

Linux package manager - AUR

OCRmyPDF version

16.3.1

Relevant log output

[phoenix@ArchLinux Downloads]$ ocrmypdf -v1 test.pdf  output.pdf |& wl-copy
ocrmypdf 16.3.1
Running: ['tesseract', '--version']
Found tesseract 5.4.1
Running: ['tesseract', '--version']
Running: ['gs', '--version']
Found gs 10.3.1
Running: ['gs', '--version']
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages in "/usr/share/tessdata/" (2):
eng
osd

pikepdf mmap enabled
os.symlink(test.pdf, /tmp/ocrmypdf.io.ripciktr/origin)
os.symlink(/tmp/ocrmypdf.io.ripciktr/origin, /tmp/ocrmypdf.io.ripciktr/origin.pdf)
Gathering info with 1 thread workers
pikepdf mmap enabled

Using Tesseract OpenMP thread limit 3
pikepdf mmap enabled
    1 Rasterize with pnggray, rotation 0
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r599.999985x599.999985', '-dPDFSTOPONERROR', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.ripciktr/origin.pdf']
    1 Rotating output by 0
    1 resolution (599.9988, 599.9988)
    1 Running: ['tesseract', '-l', 'eng', '/tmp/ocrmypdf.io.ripciktr/000001_ocr.png', '/tmp/ocrmypdf.io.ripciktr/000001_ocr_hocr', 'hocr', 'txt']
    1 [tesseract] lots of diacritics - possibly poor OCR
    1 pikepdf.Matrix(0.12, 0, 0, -0.12, 0, 167.52)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 158, 173)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 158, 247)
    1 eng
    1 pikepdf.Matrix(0.999988, -0.00499994, 0.00499994, 0.999988, 272, 277)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 416, 296)
    1 eng
    1 pikepdf.Matrix(0.999988, -0.00499994, 0.00499994, 0.999988, 159, 332)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 222, 392)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 304, 419)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 769, 444)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 222, 472)
    1 eng
    1 pikepdf.Matrix(0.995885, -0.0906255, 0.0906255, 0.995885, 237, 498)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 223, 525)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 222, 551)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 340, 577)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 222, 577)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 159, 648)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 159, 708)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 192, 767)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 159, 793)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 159, 820)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 191, 852)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 224, 878)
    1 eng
    1 pikepdf.Matrix(0.999736, -0.0229939, 0.0229939, 0.999736, 258, 906)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 191, 937)
    1 eng
    1 pikepdf.Matrix(0.999968, 0.00799974, -0.00799974, 0.999968, 262, 955)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 223, 990)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 232, 1022)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 223, 1048)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 192, 1107)
    1 eng
    1 pikepdf.Matrix(1, 0, 0, 1, 224, 1133)
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
    1 Grafting
    1 Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0)
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0

Postprocessing...
os.symlink(/tmp/ocrmypdf.io.ripciktr/graft_layers.pdf, /tmp/ocrmypdf.io.ripciktr/fix_docinfo.pdf)
Running: ['gs', '--version']
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.ripciktr/fix_docinfo.pdf', '/tmp/ocrmypdf.io.ripciktr/pdfa.ps']
GPL Ghostscript 10.03.1 (2024-05-02)
Copyright (C) 2024 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Running: ['tesseract', '--version']
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 18: treating as an optimization candidate
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1

xref 18: treating as an optimization candidate
xref 18: marking this JPEG as deflatable

xref 18: treating as an optimization candidate
xref 18: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization
Optimizable images: JBIG2 groups: 0

os.symlink(/tmp/ocrmypdf.io.ripciktr/optimize.opt.pdf, /tmp/ocrmypdf.io.ripciktr/optimize.pdf)
Running: ['jbig2', '--version']
Running: ['pngquant', '--version']
Image optimization ratio: 1.06 savings: 5.8%
Total file size ratio: 0.98 savings: -2.0%
/tmp/ocrmypdf.io.ripciktr/optimize.pdf -> output.pdf
Output file is a PDF/A-2B (as expected)
jbarlow83 commented 5 months ago

It appears to me that either

The input PDF sets a very small paper size, around 50x80mm or business card size. I imagine if the paper size were set correctly and images rescaled the issue would disappear.