tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.74k stars 9.46k forks source link

Pdf-Output with jbig2enc? #726

Closed boredland closed 7 years ago

boredland commented 7 years ago

Searching for a place to introduce JBig2-compression into my quite typical workflow I stumbled upon tesseract actually generating sandwiches - which is nice. Are there any plans to offer alternative to the, i assume, flate-compression currently used?

jbreiden commented 7 years ago

No current plans with respect to Tesseract. I think the two best approaches would be

  1. Running Tesseract output through a PDF->PDF compressor, to produce JBIG2 PDF. These exist commercially, not aware of an open source implementation.

  2. Feed JBIG2 to Tesseract as an input format. This would require teaching Leptonica to decompress JBIG2, and teaching Tesseract to copy over the input JBIG2 over to the output PDF. Downsides include the complexity of dealing with the multipage aspect, and inability to expand this approach towards mixed raster content.

  3. Start with JBIG2 PDF without a text layer (somehow). Rasterize to images. Ask tesseract to produce invisible-text-only PDF output. (This is already supported). Merge the two PDFs together. (Open source tools including pdftk can already do this.)

flate-compression currently used

Tesseract will carry forward CCITT Group 4 compression if handed TIFF G4 input.

amitdo commented 7 years ago

Are there any plans to offer alternative to the, i assume, flate-compression currently used?

https://github.com/tesseract-ocr/tesseract/blob/ca16a08c10/api/pdfrenderer.cpp#L729