ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 997 forks source link

"--force-ocr" switch increases size of pdf by factor 25 #961

Open wildgruber opened 2 years ago

wildgruber commented 2 years ago

Hello,

System: Arch linux OCRmyPDF v. 13.4.2

we are using ocrmypdf/tesseract to perform OCR on scanned books and printed historical material (never on freshly generated clean pdfs), and often try to improve the already present OCR by doing a new character recognition on the file via "--force-ocr".

This almost always gives recognition results orders of magnitude better than the one already present (especially with more complicated scripts like fraktur).

However: the resulting pdf, although with perfect new OCR, very frequently has enormously increased in size, up to 25x bigger.

Why is that and, naively spoken: is it not possible to keep the existing page images and just swap the underlying text?

Thanks and thanks of course for this terrific program that so very much facilitated the use of tesseract!

Gerald.

jbarlow83 commented 2 years ago

Why is that and, naively spoken: is it not possible to keep the existing page images and just swap the underlying text?

By using --force-ocr, you are specifically asking for the PDF to be rasterized to images and all text thrown away. In other modes, the underlying pages images are preserving. In some more extreme cases, like when there is embedded but corrupt text, this is the only option to reconstruct a readable PDF.

--redo-ocr may give better results but cannot recognize all types of OCR text. There is no standard way to embed OCR text in a PDF or distinguish regular printed text from OCR, and it is not trivial to determine if a command print text will be visible to the user. If this feature does not work, with an example PDF I may be able to add detection for other types of OCR.

Alternately optimization settings like --optimize 3 may improve file size.

mjg commented 2 years ago

It seems that in this case, feeding "rasterized images" helps tesseract do its job (tesseract works better on some resolutions than others, even if they come from down- or up-scaling from the native image resolution), whereas keeping the original pdf (images) as a base for the ocr'ed text layers is desired. Is there are combination of command line-switches which does that? [If you do that on a mxed type text/image pdf it will lead to doubling some text info, I know.]

M3ssman commented 1 year ago

Encountered similar issue on "born digitals" including rather few images and a text layer with many uncopy-able control characters. Need to redo only the texts. With --force-ocr file size increases from about 3 MB to nearly 30 MB, but with --redo-ocr results remain rather poor.

Isn't it in such cases possible to throw away the intermediate rasterized images and not include then into the final PDF since they were not part of the original data?

jbarlow83 commented 12 months ago

Improved in v15, since ocrmypdf is more careful about how it selects the raster DPI.