mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.34k stars 9.97k forks source link

PDF will not view pages because of "Bad encoding in flate stream" error #11207

Closed mdnorman closed 3 years ago

mdnorman commented 5 years ago

Attach (recommended) or Link to PDF file here:

Cannot attach or link - proprietary client docs.

Configuration:

Steps to reproduce the problem:

  1. Open the PDF
  2. View the page that has the problem. Page is blank.

What is the expected behavior? (add screenshot)

Page would be displayed (cannot attach screenshot, but it displays in Chrome PDF viewer)

What went wrong? (add screenshot)

Blank Viewer

Bad encoding in flate stream error is suppressed (not rethrown) from https://github.com/mozilla/pdf.js/blob/master/src/core/worker.js#L488 and https://github.com/mozilla/pdf.js/blob/master/src/core/worker.js#L527, but it's happening from both.

Tracing the code, the error is happening in both FlateStream.getBits (https://github.com/mozilla/pdf.js/blob/master/src/core/stream.js#L451) and FlateStream.getCode (https://github.com/mozilla/pdf.js/blob/master/src/core/stream.js#L471)

I was able to take the original compressed block and read it using pako without errors. Pako reads the last byte as a different value than FlateStream.

It only appears to happen for dynamic blocks (with included Huffman table), not fixed blocks. That leads me to believe that the Huffman table decoding is incorrect, but I've been unable to find the issue.

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

Used both hosted viewer and downloaded the source and tried it with gulp server.

timvandermeij commented 5 years ago

Unfortunately there is nothing we can do without a PDF file for reproducing the problem. You could try to find a document that is not proprietary for sharing here, or make a reduced PDF file from the original PDF file by stripping away all objects not relevant for the issue and removing confidential data so it can be shared. In that case we can reopen this.

mdnorman commented 5 years ago

Is there a secure location that I can upload my client's PDF? They likely won't want it to be in the public domain, but if it can be put somewhere securely, we may be able to get it approved.

timvandermeij commented 5 years ago

I would really recommend to reduce the PDF file or create a new one without confidential data and upload it here because that allows any contributor to help out, which may help to get the issue resolved sooner. You can remove all pages that are not relevant to reproduce the issue and get rid of unimportant objects. If that is really not possible, we might be able to arrange that you send it to us in private so we can attempt to reduce the PDF file to the bare minimum.

mdnorman commented 5 years ago

I understand. Unfortunately, I don't know how to break the file. Once I use another tool to manipulate it, the problem goes away.

I'll just have to wait for the client to approve uploading it.

mdnorman commented 5 years ago

@timvandermeij we have permission from the client to send you the documents as long as they are kept confidential.

timvandermeij commented 5 years ago

In that case, let's reopen this. I don't have much experience with reducing PDF files myself, but perhaps there is someone here willing to take care of that so we can get a good reduced test case without confidential data to work with.

/cc @Snuffleupagus @THausherr Would any of you perhaps be willing/able to reduce such a PDF file? If not, that's totally fine, it's just that I'm asking because I don't have much experience with it myself.

THausherr commented 5 years ago

Yes please send to tilman at snafu dot de. However don't expect too much from me, the whole thing could fail due to not being able to decompress that stream. My strategy (I'm sure there are others) is usually to delete objects with NOTEPAD++, open the PDF and resave it (sometimes with Apache PDFBox, sometimes with Adobe Reader). So it depends whether the software feels it must decode that stream internally or not.

mdnorman commented 5 years ago

@THausherr I have sent you an email with the documents. Thanks for taking a look!

THausherr commented 5 years ago

I had a look... I get "uncaught exception: Object" in the console. The file displays without trouble in PDFBox, and if I split it or decode it, I can display it in PDF.js. The file has several revisions.

The content streams are all cut off. The "Michael" file ends like this in the content stream of page 2:

q
  0.0 0.0 0.0 RG
  0.0 0.0 0.0 rg
  416.46375 507.1061 m
  416.46375 38.79117 l
  417.18387 38.79117 l
  417.18387 507.1061 l
  f
Q
q
  0.0 0.0 0.0 RG
  0.0 0.0 0.0 rg
  365.09555 472.0605 m
  365.09555 3 

So I guess some Flate decoder exits with error and some don't.

Snuffleupagus commented 4 years ago

Given that it's apparently not been possible to produce a reduced test-case for this issue (since that can be really difficult sometimes), it unfortunately doesn't seem meaningful to keep this issue open given that it contains no actionable information. /cc @timvandermeij

mdnorman commented 4 years ago

Please don't close this issue. This is still a bug, and there are at least two PDFs with the problem. The PDF is valid because the pages appear in many different PDF viewers.

I can provide the PDFs to someone else who have the ability to work on the problem.

Snuffleupagus commented 4 years ago

Please don't close this issue.

Sorry, but it's unfortunately not reasonable to keep non-actionable issues open perpetually. (In that case, the bug tracker would quickly get overrun with issues that are impossible to ever close.)

This is still a bug, and there are at least two PDFs with the problem.

No one is saying that there isn't a problem here, just that it's unfortunately impossible to fix it given the information currently available.

The PDF is valid because the pages appear in many different PDF viewers.

Please note: Just because other PDF viewers can open the document, that does not at all imply that it's a valid one unfortunately. Most PDF viewers, including the PDF.js project, has had to implement a lot of code to deal specifically with corrupt documents.

Furthermore, the information in https://github.com/mozilla/pdf.js/issues/11207#issuecomment-545597084 would indicate that the document is in fact corrupt.

I can provide the PDFs to someone else who have the ability to work on the problem.

Given that this is an open source project, with most people contributing in their spare time, the only reasonable way[1] to get bugs fixed is generally by providing publicly available test-cases. (Also, keep in mind that a patch would normally be required to include tests as well.)

Edit: The "have the ability to work on the problem." part may be difficult for someone to guarantee upfront, without having seen the document in question.


[1] If you're using the PDF.js library in commercial setting, you may also consider hiring/paying someone to help you write a patch to address this issue.

THausherr commented 4 years ago

I have deleted the two confidential files.

timvandermeij commented 4 years ago

I mostly agree with https://github.com/mozilla/pdf.js/issues/11207#issuecomment-552983536 and think the only reasonable way forward is to have a public test file here so everyone can work on this. Is it really not possible to create a document with non-sensitive/dummy information, using the same PDF generator, that also shows the problem?

mdnorman commented 4 years ago

Unfortunately, we (both us and our client) are unaware what software created the PDF.

mdnorman commented 4 years ago

Based on the PDF, the Content Creator is Paychex MMS PDF Creator v1.1.0 and the Encoding Software Adobe Acrobat Standard DC 15.6.30503 (https://www.adobe.com/devnet-docs/acrobatetk/tools/ReleaseNotesDC/classic/dcclassic15.006august2019qfe.html), which was pretty old (2015)

peterrobinson commented 4 years ago

I have a file with this issue too. I attach it here. The pdf was created by dragging a single page from a multipage pdf. It opens in all the viewers I can find. The error comes when I attempt to use pdf.js getTextContent, so:

    var loadingTask = this.pdfjsLib.getDocument({data: text});
loadingTask.promise.then(function(pdf) {
    pdf.getPage(1).then(function(page) {
        page.getTextContent().then(function(textContent) {
            // don't get here with error bad encoding in flate stream
        });
    });
});

WBP1-451 -P1.pdf

Snuffleupagus commented 4 years ago

Unfortunately https://github.com/mozilla/pdf.js/issues/11207#issuecomment-609021055 doesn't seem relevant to this issue, since that file (and its text-layer) renders just fine in e.g. the master version of the PDF.js library; most likely an older/out-dated PDF.js version was used to test with.

For future reference: Please note that it's always recommended to open a new issue when you encounter a PDF document that doesn't render/work correctly, since it's easy to mark it as a duplicate if that turns out to be the case. However, having potentially different documents reported in the same issue makes tracking things more difficult.

peterrobinson commented 4 years ago

The issue is not rendering: it is the error in getTextContent which is the problem. There are other problems with pdf.js. See the question at stackoverflow Good point about the possibly out-of-date pdg.js. I'll check that.

Snuffleupagus commented 4 years ago

Please directly any additional follow-up/questions regarding https://github.com/mozilla/pdf.js/issues/11207#issuecomment-609021055 to a new issue, since it doesn't seem relevant to this one; thank you!

Snuffleupagus commented 3 years ago

Note that the comments starting at https://github.com/mozilla/pdf.js/issues/11207#issuecomment-609021055 are completely irrelevant to this issue.


Given that this issue is still effectively not actionable without a publicly available test-case, and that the points in https://github.com/mozilla/pdf.js/issues/11207#issuecomment-552983536 still applies (more than a year later), I'd suggest closing this issue as INCOMPLETE for now.