So i got 'lots of diacritics - possibly poor OCR', i ran the output pdf and tried selecting text, some text weren't being selected. So i tried using tesseract on them grimblast save area - | tesseract - - | wl-copy ; notify-send "$(wl-paste)", and tesseract was able to grab them successfully. Why using tesseract standalone worked but ocrmypdf didn't?
the pdf was 250pages, and i got error on all of the pages. I extracted the single page from pdf and ran ocrmypdf on them inorder to reduce the size of pdf that i would have to upload here.
the workflow you used for tesseract does not preserve stderr (standard error) so the message it reports about diacritics is lost -- ocrmypdf intercepts this message and turns into a warning because it is a reliable indicator that OCR may be poor, or
the workflow you used produces a different image resolution that does not cause tesseract to "see" diacritics in the text in this particular case
The input PDF sets a very small paper size, around 50x80mm or business card size. I imagine if the paper size were set correctly and images rescaled the issue would disappear.
Describe the bug
So i got 'lots of diacritics - possibly poor OCR', i ran the output pdf and tried selecting text, some text weren't being selected. So i tried using tesseract on them
grimblast save area - | tesseract - - | wl-copy ; notify-send "$(wl-paste)"
, and tesseract was able to grab them successfully. Why using tesseract standalone worked but ocrmypdf didn't?Steps to reproduce
Files
the pdf was 250pages, and i got error on all of the pages. I extracted the single page from pdf and ran ocrmypdf on them inorder to reduce the size of pdf that i would have to upload here.
test.pdf output.pdf
How did you download and install the software?
Linux package manager - AUR
OCRmyPDF version
16.3.1
Relevant log output