Closed boredland closed 7 years ago
No current plans with respect to Tesseract. I think the two best approaches would be
Running Tesseract output through a PDF->PDF compressor, to produce JBIG2 PDF. These exist commercially, not aware of an open source implementation.
Feed JBIG2 to Tesseract as an input format. This would require teaching Leptonica to decompress JBIG2, and teaching Tesseract to copy over the input JBIG2 over to the output PDF. Downsides include the complexity of dealing with the multipage aspect, and inability to expand this approach towards mixed raster content.
Start with JBIG2 PDF without a text layer (somehow). Rasterize to images. Ask tesseract to produce invisible-text-only PDF output. (This is already supported). Merge the two PDFs together. (Open source tools including pdftk can already do this.)
flate-compression currently used
Tesseract will carry forward CCITT Group 4 compression if handed TIFF G4 input.
Are there any plans to offer alternative to the, i assume, flate-compression currently used?
https://github.com/tesseract-ocr/tesseract/blob/ca16a08c10/api/pdfrenderer.cpp#L729
Searching for a place to introduce JBig2-compression into my quite typical workflow I stumbled upon tesseract actually generating sandwiches - which is nice. Are there any plans to offer alternative to the, i assume, flate-compression currently used?