mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.18k stars 9.95k forks source link

Page in PDF not loading completely in Firefox, missing image in Chromium #8076

Closed pcworld closed 1 year ago

pcworld commented 7 years ago

Link to PDF file: https://www.personalausweisportal.de/SharedDocs/Downloads/DE/Flyer-und-Broschueren/eID_Broschuere.pdf?__blob=publicationFile&v=1 (sha512sum 62e4a6a96257f219d1f1fc1c644eea86b17c056486a91a159e4cc0085347b3bc3c3d1a4e06eae204e1dce92f507afa5d55c21604584ffa852efb7fb303f9f846)

Configuration:

Steps to reproduce the problem:

  1. Open the PDF in pdf.js and scroll to the last page.

What is the expected behavior? The last page renders like in the following screenshot of the page in Evince: Last page in Evince

What went wrong? In Firefox: Page doesn't finish loading (loading animation doesn't disappear), page is only partially loaded: Last page in pdf.js in Firefox In Chromium: Loading animation disappears, but the QR code is blank (white): Last page in pdf.js in Chromium

The PDF appears to have been created by "Adobe InDesign CS5.5 (7.5.3)".

yurydelendik commented 7 years ago

Hmm, that's QR code size is 16667x16667, it's like 8000dpi image.

yurydelendik commented 7 years ago

@pcworld can you check if you can reduce size of the image to acceptable dpi --perhaps 300dpi. I'm sure other PDF readers and network providers (well it's only 35k in the PDF) will say thanks :)

pcworld commented 7 years ago

@yurydelendik I am not the creator of this PDF. While the resolution of the image is indeed ridiculous, Evince somehow renders it in a few seconds only.

yurydelendik commented 7 years ago

Yeah, there is probably optimization to render such images at lower resolution we are missing. I'm marking the issue with performance tag.

Rob--W commented 7 years ago

I have extracted the relevant page: eID_Broschuere-page16.pdf

The culprit is the image mask of size 16667x16667. We should scale down the image.

16 0 obj
<< /BitsPerComponent 1 /DecodeParms << /Columns 16667 /K -1 >> /Filter /CCITTFaxDecode /Height 16667 /ImageMask true /Subtype /Image /Type /XObject /Width 16667 /Length 38972 >>
stream

Implementation details:

To scale down the image, it is probably best to detect large images, slice the image in pieces, transforming (=scaling) all those images individually (on a canvas) and then painting all images together. An alternative approach is to manually transform the image, i.e. interpreting the pixels of the image yourself and interpolate the pixel values while scaling. The advantage of the latter is that its runtime performance is likely better for arbitrarily large images, and that the logic can be shared by our canvas and SVG backends (since this would then more be a math problem than a rendering task).

Debugging tips:

If you are going to debug this issue with a debugger, consider adding #disableWorker=true to the URL. Otherwise you have to account for the fact that the logic of src/core runs in a Web Worker, while the canvas logic runs on the main thread.

apoorv-mishra commented 7 years ago

Hi, I would like to work on this issue. But, I have got a couple of questions regarding @Rob--W 's comment above -

To scale down the image, it is probably best to detect large images...

  1. What's the benchmark for large images? It would be helpful if you can specify the resolution(in w x h) which may be labelled as large. An alternative approach is to manually transform the image, i.e. interpreting the pixels of the image yourself and interpolate the pixel values while scaling.
  2. What do you mean by interpreting the pixels and interpolate the pixel values? Are you referring to Bilinear Interpolation? Would the implementation be similar to https://github.com/mozilla/pdf.js/blob/master/src/core/colorspace.js#L34-L56?

Also, if possible, please guide me with providing appropriate resources which I can study from and go about implementing the solution.

Rob--W commented 7 years ago

Hi @apoorv-mishra

To scale down the image, it is probably best to detect large images...

  1. What's the benchmark for large images? It would be helpful if you can specify the resolution(in w x h) which may be labelled as large.

I don't have a specific value in mind, but I was thinking of big images whose width/height are significantly larger than the actually painted image, to the extend that the native image decoder would be unable to handle it, or that it would require an excessive amount of memory). You can start with a hard-coded value (that you find experimentally by trying to paint images of that size, and/or by looking at other parts of PDF.js where a maximum image/canvas size is enforced). If needed, we can make the logic more complex later (e.g. deciding whether an image is too large based on the actual size of the painted image in the rendered PDF). But for now, let's keep it simple and choose a reasonable hard-coded threshold.

An alternative approach is to manually transform the image, i.e. interpreting the pixels of the image yourself and interpolate the pixel values while scaling.

  1. What do you mean by interpreting the pixels and interpolate the pixel values? Are you referring to Bilinear Interpolation? Would the implementation be similar to https://github.com/mozilla/pdf.js/blob/master/src/core/colorspace.js#L34-L56?

By interpreting, I mean code that takes the pixel data and does something with it. By interpolating, I mean interpolation in the mathematical sense. That is, take a (large) group of pixels, calculate one pixel value that approximates the appearance of the original set of pixels. Bilinear interpolation is one of the possible ways to do it, you need to investigate the available options and see which one results in a drawing that is the best approximation of the original image. Search for "browser image resize algorithm" in your favorite search engine to learn more about how browsers scale images The resizeRgbImage function that you linked indeed looks like a good start.

Tip: When you link to code online, link to a specific commit instead of a branch (like "master"), because the code in the master branch can change and the line numbers can become different). On Github you can get a specific commit by pressing the "y" key. So https://github.com/mozilla/pdf.js/blob/master/src/core/colorspace.js#L34-L56 becomes https://github.com/mozilla/pdf.js/blob/5b5781b45d234666241bf3354c0d390315c31d1a/src/core/colorspace.js#L34-L56

Also, if possible, please guide me with providing appropriate resources which I can study from and go about implementing the solution.

The code that I linked in my previous comments is a good start. At that point, the problem space has already been reduced from "an image in a PDF file" to "an image to be displayed". You don't need specialized knowledge of PDF to implement this.

If you want to go with the first method, you need to know how to work with the canvas API - https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API. The PDF.js code base already has several eexamples. To implement the alternative approach, you need to know what the bytes of the image data mean. I don't know this from the top of my head, but if you step through the code with a debugger you can probably see some useful information.

Snuffleupagus commented 1 year ago

@calixteman Will this issue also be fixed by PR #16077? The link above seems to be broken, but the document is available (as issue8076.pdf) in the PDF archive I shared a while back.

calixteman commented 1 year ago

Yep it's one of the files I used (and the pdf can be found in https://github.com/mozilla/pdf.js/issues/8076#issuecomment-314078517). I should add it in the test suite since it takes a specific path (it's a mask): https://github.com/mozilla/pdf.js/blob/d7e4be9cdbe37d4d4d9ada34208820470bcd14ed/src/core/image.js#L374