ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 998 forks source link

[Feature]: convert grayscale PDF to jbig monochrome while doing OCR #1246

Closed callegar closed 7 months ago

callegar commented 7 months ago

Describe the proposed feature

OCRmyPDF seems to be quite good at producing a B/W image for the OCR engine. Would be great to use this machinery also to produce extra small PDFs by converting grayscale to B/W that can be compressed with JBIG2.

Rationale: having small OCRed PDFs. Many scanners give you poor quality if you scan directly to "lineart" because they use a global threshold. With them, scanning to grayscale is the only option. Then, trying to convert "by hand" the scan to B/W can give you good results, but is a delicate process. Some tools (e.g. the textcleaner imagemagick scripts) can help, but things remain laborious. Would be great to have ocrmypdf capable of doing this automatically.

jbarlow83 commented 7 months ago

Very aggressive pngquant settings might do the job, because it can quantize all the way down to 1bpp, which triggers JBIG2 conversion. Adjust the optimization threshold and use -O3.

This could be done with a plugin that simply selects the "image for display" as equal to the image for OCR, and uses force-ocr mode, and might work for you. Alternately you could replace the optimize plugin with your own to have it act more aggressively.

Generally speaking, the next step for optimization would be a mixed raster conversion. (Segment the image, determine what sections can be expressed in a lower colorspace.) As it happens archive.org has a tool for this, and while it's not license compatible (I think it's AGPL), I'm a bit reluctant to rewrite something that already exists in open source. There was an effort to see if we could the license changed.

I can also imagine some clever alternate approaches using ML algorithms to explore optimization of PDF pages across all algorithms and multiple resolutions for a minimum bit budget/loss ratio.

Anyway, too many options too little time, so I've outlined what I think can be achieved using existing tools.