ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 997 forks source link

Please, allow, optimization to separate pages into layers #557

Closed rbrito closed 4 years ago

rbrito commented 4 years ago

Is your feature request related to a problem? Please describe.

This is perhaps a question and, possibly, a feature request.

It is not clear if OCRmyPDF is able to take advantage of separation of layers of mixed mode files that scantailor/scantailor-advanced creates to encode the majority of pages in B/W instead of encoding them with DEFLATE (or, perhaps, JPEG).

Describe the solution you'd like

During optimization of the files, separation of layers, to obtain better compression ratios of files with the same appearance of the PDFs.

Additional context

DjVu encoders are able to perform this, as is a Ruby program called pdfbeads. It would be superb to have OCRmyPDF do this also.

jbarlow83 commented 4 years ago

In some cases, OCRmyPDF will decide it can losslessly insert OCR and preserve how the page was originally organized. In other modes, we rasterize the page to an image at the widest common colorspace needed for that page (potentially converted pages with a small amount of color to all color).

In both cases we do image-level optimization which can include color quantization all the way down to 1-bit images, but only when a whole image can reasonably be describe with that colorspace. This happens to work in both of the above cases. Pages that were all black and white will kept that way, so in the case of Scantailor if you convert most pages to B/W and then use something like img2pdf to create PDFs which each image in the correct colorspace, you'll get a result that is fairly optimal for most pages.

Segmenting the page into color regions would only work for the lossy case where we rasterize the page to a full page image - mainly it's a matter of finding the time to implement this feature. This would bring improvements, especially for some documents with e.g. small color on each page and all black and white text, which are currently forced in color with a disappointing increase in file size.

The lossless case is too complex for segmentation. Unfortunately there are no real layers in PDF. The content stream just describes a sequence of drawing operations which can be modified by masks and transparency, which makes it quite difficult to reason about the final appearance of a page short of implementing full rasterization.

rbrito commented 4 years ago

Okay, I'm closing this issue... I guess that it is more appropriate for img2pdf...