Support for JPEG2000, jp2 output

ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

http://ocrmypdf.readthedocs.io/

Mozilla Public License 2.0

13.96k stars 1.01k forks source link

Support for JPEG2000, jp2 output #445

Closed aalmir closed 1 year ago

aalmir commented 4 years ago

Describe the issue Output of PDF/A 2b with jpeg2000/jp2 images. If the input file is a PDF/A 2b with jp2 images the output should also be with jp2 images.

jbarlow83 commented 4 years ago

Ghostscript does not generate PDF/A with JPEG2000. It converts all of them to JPEG.

It would be possible to replace the images after PDF/A conversion without losing PDF/A compliance in some cases. It depends on the complexity of the input JPEG2000 images.

There are problems with JPEG2000, however:

lingering patent concerns but relatively low risk now
JPEG encoders have surpassed JPEG2000 in perceptual quality for the same bit budget
the high complexity of the standard makes it difficult to check conformance
decoding is not as optimized

There are still some applications where JPEG2000 is superior. What do you need JPEG2000 for?

aalmir commented 4 years ago

JPEG2000 has lossless compression and is much smaller than JPEG. We use JPEG2000 for long-term archiving of our images.

But the rendering is much faster with JPEG-PDFs, some pages with JPEG2000 are up to 5 times slower.

Maybe JPEGXL will make everything better...

jbarlow83 commented 4 years ago

JPEGXL sounds promising but I don't think it's in the PDF 2.0 spec at all, so we're many years from it being in use in PDF.

I personally think JPEG is a better long term archiving format despite its limitations. JPEG will never go away since so much media exists in it, but JPEG2000 has been superseded.

For now you can use --pdfa-image-compression=lossless to make a PDF/A with all lossless images although they will be flate encoded.

It would be possible for the optimizer to put JPEG2000 images back into the file after PDF/A conversion, for straightforward images and color spaces. Or to try using JPEG2000 encoding.

IcedQuinn commented 1 year ago

Maybe JPEGXL will make everything better...

If only Google hadn't up and killed it :pain: