internetarchive / archive-pdf-tools

Fast PDF generation and compression. Deals with millions of pages daily.
https://archive-pdf-tools.readthedocs.io/en/latest/
GNU Affero General Public License v3.0
97 stars 13 forks source link

A certain PDF from Archive.org does not display all of its contents on Mac OS #66

Closed EngineersNeedArt closed 1 year ago

EngineersNeedArt commented 1 year ago

This PDF:

https://archive.org/download/htewypc/EntertainPocketCalculator.pdf

Tried Preview, Acrobat Reader — all on Mac OS Monterey (an M1 MacBook Pro FWIW).

MerlijnWajer commented 1 year ago

Thank you for the report.

I looked at this briefly today with a colleague and we didn't quite figure out why Preview doesn't like it. We also found that iOS PDF rendering has the same issues with this PDF.

I'll try to create a few different versions of the PDF and see if they show the same problems (using CCITT instead of JBIG2 for example). All the PDFs we generate are validated against PDF/A using veraPDF, and I don't think the JBIG2 or JPEG2000 bitstreams are wrong, so I'm a bit puzzled. I'll try to get back to you soon.

MerlijnWajer commented 1 year ago

@jrochkind and @EngineersNeedArt - I've created various PDFs from the item in question, I was hoping one of you would be able to try various PDFs on your PDF viewers and let me know which ones work, and which ones don't - they might not all look the same, but that doesn't matter for the purpose of this test (I hope). Testing the files would be very valuable in the blind-debugging process required with Apple Preview. The files in questions are:

  1. normal-jbig2.pdf
  2. normal-ccitt.pdf
  3. grok-jbig2.pdf
  4. grok-ccitt.pdf
  5. jpeg-ccitt.pdf

You can find them here: https://archive.org/~merlijn/preview-debugging/

Thanks in advance.

EngineersNeedArt commented 1 year ago

The only one that displays the missing content is "jpeg-ccitt.pdf".

The pages (save the first page) however need a white background drawn first to look correct.

Screen Shot 2023-05-16 at 9 24 05 AM
EngineersNeedArt commented 1 year ago

So it looks like there is an issue with JP2 on Mac OS. I downloaded the "SINGLE PAGE PROCESSED JP2 ZIP" for the same book and only two of the images opened successfully in Apple's Preview. The rest were met with an error dialog suggesting that the file was damaged.

Console shows logs like:

"PVImageContainer initWithURL:file:///Users/calhoun/Downloads/EntertainPocketCalculator_jp2/EntertainPocketCalculator_0199.jp2 failed, error = Error Domain=NSCocoaErrorDomain Code=259 "The file “EntertainPocketCalculator_0199.jp2” could not be opened." UserInfo={NSURL=file:///Users/calhoun/Downloads/EntertainPocketCalculator_jp2/EntertainPocketCalculator_0199.jp2, NSLocalizedDescription=The file “EntertainPocketCalculator_0199.jp2” could not be opened., NSLocalizedRecoverySuggestion=It may be damaged or use a file format that Preview doesn’t recognize.}"

"initialize:1291: *** invalid JPEG2000 file"

MerlijnWajer commented 1 year ago

Thanks for testing - this indeed seems to suggest to me that the problem seems to be in the usage of JPEG2000 images. To clarify, the normal and grok files are both using JPEG2000 as foreground and background image encoding, while the jbig2 and ccitt part of it is just the 1-bit mask encoding.

I am aware of the JPEG visual artifacts - I only added it recently to experiment with it, it's not usable in any production setting currently.

I guess next up would be for me to extract the JPEG2000 images from the PDFs, and confirm with you (or a colleague) that Preview also cannot open some of the JPEG2000 images that are embedded in the PDF.

Could you share with me how you get to these console logs, so that I can also do some digging?

MerlijnWajer commented 1 year ago

Regarding this part of your comment:

So it looks like there is an issue with JP2 on Mac OS. I downloaded the "SINGLE PAGE PROCESSED JP2 ZIP" for the same book and only two of the images opened successfully in Apple's Preview. The rest were met with an error dialog suggesting that the file was damaged.

The "SINGLE PAGE PROCESSED JP2 ZIP" files aren't what is in the PDFs - those are further compressed JPEG2000s, but it's good to know that even the "SINGLE PAGE PROCESSED JP2 ZIP" files do not work for you.

EngineersNeedArt commented 1 year ago

I open the app "Console" on the Mac. Filter by "Preview".

Screen Shot 2023-05-16 at 9 41 20 AM
MerlijnWajer commented 1 year ago

Could you confirm which page is the first one that doesn't render correctly? Someone wrote on HN that pages 2-10 do not render, so perhaps we can try with page 5.

I ran this on my laptop:

pdfimages normal-jbig2.pdf -all -f 5 -l 5 jp2test which produces jp2test-000.jp2, jp2test-001.jp2 and jp2test-002.jb2e.

I have uploaded these files and the files converted to PNG (so that you can visually compare) to this URL: https://archive.org/~merlijn/preview-debugging/page-5/

The JPEG2000 images will just be a mostly white image, and a mostly black image (due to the nature of the content). The JBIG2 image (.jb2e) will almost certainly not open in Preview regardless, since it lacks the required headers, as it is an "embedded" JBIG2 - so for the purpose of this test you can disregard the JBIG2 entirely, I've just added it (also as PNG form) so that you can see what the mask looks like.

So please check if you can open these JPEG2000 images at all in Preview, and if not, I'll try to figure out why not.

jrochkind commented 1 year ago

Interesting, I had heard from some colleagues that "MacOS [Preview and Safari] have trouble displaying some jpeg2000", but hadn't reproduced myself and hadn't been sure how widespread this was.

I wonder if we can figure out what aspects of the jpeg2000 are triggering a problem. All the jpeg2000 I have yet created myself, my MacOS Preview has no problem displaying.

I extracted the images from https://archive.org/download/htewypc/EntertainPocketCalculator.pdf with the pdfimages cli util that comes with poppler. And verified that indeed a number of the images would not even open in my MacOS Preview, with a similar error message to what @EngineersNeedArt reports.

I wanted to see if Chrome could display them -- but it looks like Chrome can't actually display raw .jp2 at all, even though it can display PDFs with embedded jp2, including the embedded images! Chrome seems to have no trouble displaying all pages in the original PDF.

EngineersNeedArt commented 1 year ago

jp2test-000.jp2 16-May-2023 14:54 631

Does not load/display in Safari. Downloaded, does not display in Preview: same, "File may be damaged" message.

jp2test-000.jp2.png 16-May-2023 14:53 11615

(Displays only white rectangle.)

jp2test-001.jp2 16-May-2023 14:54 483

Does not load/display in Safari. Downloaded, does not display in Preview: same, "File may be damaged" message.

jp2test-001.jp2.png 16-May-2023 14:53 2294

(Displays only black rectangle.)

jp2test-002.jb2e 16-May-2023 14:54 9389

Not recognized as an image file at all. Forcing Preview to open it anyway results in nothing being opened/displayed.

jp2test-002.jb2e.png

The only one that displays correctly.

MerlijnWajer commented 1 year ago

Just to be clear, seeing just a white and black rectangle in this case is entirely expected, thanks for confirming that indeed the JPEG2000s themselves, outside of the PDF, do not open in Preview. That seems to be in line with what @jrochkind just posted.

EngineersNeedArt commented 1 year ago

Chrome seems to have no trouble displaying all pages in the original PDF.

Perhaps they have their own "JPEG2KLib" and are not using the OS'es.

ImageIO on the Mac is typically using the open image libraries (like jpeglib, pnglib, etc.) so I am not sure why Macs would exhibit this problem.

jrochkind commented 1 year ago

Just a random hypothesis that may have nothing to do with anything, but as part of exploring this domain I have been looking into the complexity on embedded ICC color profiles, and discovering that there is some complexity and disagreement between tools about whether and what kinds of embedded ICC color profiles are supported by jp2k.

I am curious to see if/what kind of color profile may be embedded in these images. But I haven't yet figured out how to check that. This tool may do so, but does not offer packages for mac!

This is probably a red herring though.

EngineersNeedArt commented 1 year ago

I am curious to see if/what kind of color profile may be embedded in these images. But I haven't yet figured out how to check that.

Try exiftool: https://exiftool.org

"ExifTool lets you examine ICC profiles, regardless of whether they are embedded in an image or as stand-alone ".icc" files. It also lets you extract ICC profiles from images and embed them into images.

Extract: exiftool -icc_profile -b -w icc photo.jpg
Embed: exiftool "-icc_profile<=profile.icc" photo.jpg
Examine directly in an image: exiftool -icc_profile:* photo.jpg
Examine an ".icc" file: exiftool profile.icc"
MerlijnWajer commented 1 year ago

I also found some hits online regarding EXIF data or ICC in the EXIF data causing trouble, but the input images do not seem to contain either an ICC Profile or EXIF data (see EntertainPocketCalculator_0005.jp2 for example). And the images generated in the PDF also probably should not contain any EXIF data (although the embedded ones seems to have a bit of a weird dpi value).

Does the original EntertainPocketCalculator_0005.jp2 image open in Preview? I think we could learn a lot if we can find some of the more 'normal' images that do not work in Preview as well. I believe @EngineersNeedArt said that only two of the "SINGLE PAGE PROCESSED JP2 ZIP" opened OK - which ones did open OK?

EngineersNeedArt commented 1 year ago

Decided to look at the XMP metadata using exiftool for the only known good JP2K image in the bunch (the cover image: EntertainPocketCalculator_0000.jp2) and page 5 (known bad image: EntertainPocketCalculator_0005.jp2).

calhoun@Johns-M1 GliderVintage % exiftool -xmp:all /Users/calhoun/Downloads/EntertainPocketCalculator_jp2/EntertainPocketCalculator_0000.jp2 
XMP Toolkit                     : Image::ExifTool 11.88
Bits Per Sample                 : 8
Photometric Interpretation      : RGB Palette
Planar Configuration            : Chunky
Samples Per Pixel               : 1

calhoun@Johns-M1 GliderVintage % exiftool -xmp:all /Users/calhoun/Downloads/EntertainPocketCalculator_jp2/EntertainPocketCalculator_0005.jp2 
XMP Toolkit                     : Image::ExifTool 11.88
Bits Per Sample                 : 1
Photometric Interpretation      : WhiteIsZero
Planar Configuration            : Chunky
Samples Per Pixel               : 1
calhoun@Johns-M1 GliderVintage % 

My small sample-size test suggests 1-bit-per-sample images may be a problem?

MerlijnWajer commented 1 year ago

If that is the case, that would be quite disastrous, but I am inclined to agree that this currently seems like a good assumption.

EngineersNeedArt commented 1 year ago

No way to generate 8-bit JPEG2K files? Or does the file size become needlessly too large?

EngineersNeedArt commented 1 year ago

Does the original EntertainPocketCalculator_0005.jp2 image open in Preview?

No only the first image: EntertainPocketCalculator_0000.jp2 and the odd one: EntertainPocketCalculator_0210.jp2

The latter too is an RGB image:

calhoun@Johns-M1 GliderVintage % exiftool -xmp:all /Users/calhoun/Downloads/EntertainPocketCalculator_jp2/EntertainPocketCalculator_0210.jp2 
XMP Toolkit                     : Image::ExifTool 11.88
Bits Per Sample                 : 8
Photometric Interpretation      : RGB Palette
Planar Configuration            : Chunky
Samples Per Pixel               : 1

(edit to fix the file name for the first image)

MerlijnWajer commented 1 year ago

No way to generate 8-bit JPEG2K files? Or does the file size become needlessly too large?

I will have to investigate.

In general, for this specific PDF, there is a much better way to handle most of these pages, which is to not do the MRC at all, and make the page only a single JBIG2 image per page, because we can realistically omit the MRC part altogether for monotone images. I just have not gotten round to implementing that (a heuristic based on the image). I will look later today to see what I can do about 8 bit JPEG2000 images from monotone input, but it feels like a bit of a kludge regardless.

Once we've confirmed this really is the problem, I wonder if there is a way to get Apple to fix Preview.

EngineersNeedArt commented 1 year ago

Yeah, Apple should have a Radar filed with a few JPEG2000 images attached that show the issue.

jrochkind commented 1 year ago

I only just realized in my email (which puts full names instead of just handles in the web UI) that @EngineersNeedArt is the famous John Calhoun, retired from Apple. Love your work John, thanks for the assistance!

EngineersNeedArt commented 1 year ago

Ha ha, no problem. And I'll reach out to some of the engineers I know within Apple about the JPEG2000 bug.

MerlijnWajer commented 1 year ago

Oh, if you could reach out that would be great!

If I remember, your also wrote on HN that Adobe had similar problems with some of the pages ("Weirdly, Adobe Reader didn't like it either...."). If you're up for it, I'd like to try to figure out if the problem with Adobe is of a similar nature...

If it is, that will probably sway archive.org to change the software to either always make 8bit images in the PDFs, or for me to just write the code to skip MRC altogether if we have 1bit input images.

EngineersNeedArt commented 1 year ago

I would suspect Adobe on Mac OS is using ImageIO for its JPEG2K rendering.

EngineersNeedArt commented 1 year ago

I filed a report with Apple and they are looking into the problem with JPEG2000s on their end.