ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.96k stars 1.01k forks source link

Support for JPEG2000, jp2 output #445

Closed aalmir closed 1 year ago

aalmir commented 4 years ago

Describe the issue Output of PDF/A 2b with jpeg2000/jp2 images. If the input file is a PDF/A 2b with jp2 images the output should also be with jp2 images.

jbarlow83 commented 4 years ago

Ghostscript does not generate PDF/A with JPEG2000. It converts all of them to JPEG.

It would be possible to replace the images after PDF/A conversion without losing PDF/A compliance in some cases. It depends on the complexity of the input JPEG2000 images.

There are problems with JPEG2000, however:

There are still some applications where JPEG2000 is superior. What do you need JPEG2000 for?

aalmir commented 4 years ago

JPEG2000 has lossless compression and is much smaller than JPEG. We use JPEG2000 for long-term archiving of our images.

But the rendering is much faster with JPEG-PDFs, some pages with JPEG2000 are up to 5 times slower.

Maybe JPEGXL will make everything better...

jbarlow83 commented 4 years ago

JPEGXL sounds promising but I don't think it's in the PDF 2.0 spec at all, so we're many years from it being in use in PDF.

I personally think JPEG is a better long term archiving format despite its limitations. JPEG will never go away since so much media exists in it, but JPEG2000 has been superseded.

For now you can use --pdfa-image-compression=lossless to make a PDF/A with all lossless images although they will be flate encoded.

It would be possible for the optimizer to put JPEG2000 images back into the file after PDF/A conversion, for straightforward images and color spaces. Or to try using JPEG2000 encoding.

IcedQuinn commented 1 year ago

Maybe JPEGXL will make everything better...

If only Google hadn't up and killed it :pain: