Closed user1823 closed 2 months ago
Probably corrupt font, but will need test file.
If I rewrite this file using GhostScript (with the below command) and then use ocrmypdf, the issue disappears.
gswin64.exe -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sOutputFile=gs.pdf in.pdf
But still, the quality of the OCR is very poor. OCRmyPDF barely changes any of the original (incorrect) text when using the --redo-ocr
option.
I see this warning/advise in the Terminal:
1 some text on this page cannot be mapped to characters: consider using --force-ocr instead
Now, if I use --force-ocr
, the quality of the OCR is drastically better but the file size increases by 15%. So, is using --force-ocr
the only way to OCR this file? Or is there any other hack available that I can use or you can add to OCRmyPDF?
Or is the issue (of --redo-ocr
not being useful) caused by GhostScript? Is the OCRed text accurate when the current (unreleased) version of OCRmyPDF is used on the original file?
The issue was with pdfminer interpreting the Unicode mapping data. If Ghostscript rewrote it, it could have worked around the issue. Even a one byte adjustment could have been a workaround.
--redo-ocr
has some limitations - there's no standard way of encoding OCR or marking text as OCR, so it can't detect all cases.
--force-ocr
is the best option for this file.
Regarding this
1 some text on this page cannot be mapped to characters: consider using --force-ocr instead
That means the mapping to Unicode is incomplete - this can cause characters to appear correctly when selected, but they will copy-paste as gibberish, and also the behavior will vary based on the PDF viewer since some try heuristics to detect the text encoding. That's why it's best to throw out everything and force OCR for this file.
this can cause characters to appear correctly when selected, but they will copy-paste as gibberish
Is it not possible for OCRmyPDF to correct the mapping of the characters based on the characters detected by OCR?
To clarify, my question is not whether OCRmyPDF is currently able to correct the mapping (which I assume it can't). My question is whether OCRmyPDF can be modified to be able to correct the mapping.
The main reasons for which I don't want to use --force-ocr
include
If there is a way around, I would really like to avoid using --force-ocr
.
Is it not possible for OCRmyPDF to correct the mapping of the characters based on the characters detected by OCR?
Possible but hard. That's pretty major surgery and the results from doing something like force-ocr are often better. Ghostscript recently added a mode that attempts to fix broken font mappings (whether the font is OCR-derived or some other origin).
You can avoid lossy recompression using --output-type pdf
and --optimize 1
. The images do get rendered, but at a higher DPI than their source, so this is safe in almost cases.
Possible but hard. That's pretty major surgery
I would really appreciate if such a feature is eventually added to OCRmyPDF (because you said that it's hard, I don't expect it anytime soon).
Ghostscript recently added a mode that attempts to fix broken font mappings
Can you please tell how to activate that mode?
You can avoid lossy recompression using --output-type pdf and --optimize 1.
Isn't --optimize 1 the default?
Describe the bug
OCR failed to complete.
Steps to reproduce
Files
Let me know if you need the file (if the issue is not clear from the error message)
How did you download and install the software?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
16.4.2
Relevant log output