pdfcpu / pdfcpu

A PDF processor written in Go.
http://pdfcpu.io/
Apache License 2.0
6.77k stars 466 forks source link

validate: panic FindTableEntry #401

Open hochhaus opened 2 years ago

hochhaus commented 2 years ago

Following up on #399, the third class of panics was found in one document.

Attached, please find the output from the following commands:

./pdfcpu validate -v doc5.pdf &> doc5-crash-v.log
./pdfcpu validate -vv doc5.pdf &> doc5-crash-vv.log

doc5-crash-v.log doc5-crash-vv.log

These PDFs display correctly in PDF.js, pdfium and evince.

hhrutter commented 2 years ago

Hi there and thanks for considering pdfcpu!

Unfortunately all three categories of validation errors reported are nothing I can fix quickly & remotely.

We'll probably need some sort of workaround (we already have a couple of those in place) if you claim most of PDF viewers don't seem to have a problem with the files in question. If they do render in Acrobat Reader and Mac Preview I'll definitely provide a fix - that's pdfcpu's commitment.

In order to get working on this I will need to get a hand on the files spotting the problem. You can go ahead and send them to my private email which you'll find on my profile page.

hochhaus commented 2 years ago

Thanks for investigating @hhrutter!

If they do render in Acrobat Reader and Mac Preview I'll definitely provide a fix - that's pdfcpu's commitment.

At the moment I only have access to a linux machine so I only verified they were visible in PDF.js (FF viewer), pdfium (chrome viewer) and evince (gnome viewer). I'll verify Acrobat Reader and Mac Preview shortly and report my findings back.

In order to get working on this I will need to get a hand on the files spotting the problem. You can go ahead and send them to my private email which you'll find on my profile page.

Unfortunately, the files contain PII that I'm prohibited from disclosing (even only to you for debugging). Worse yet, they are user submitted so I have very limited information about how they were generated. I can tell the following from document meta data:

$ pdfinfo doc1.pdf
...
Creator:        Adobe Acrobat Pro DC 18.11.20035
...
$ pdfinfo doc3.pdf
... 
Producer:       ... using ABCpdf
...
$ pdfinfo doc5.pdf 
...
Producer:       DocHub 1.2.3
...

I know this isn't very helpful though in terms of your debugging / creating alternate files to debug from. Can you think of any other way I could get you the required information to debug? Do you know of a tool which would zero out / redact all of the text in the document such that I could submit the files to you?

hhrutter commented 2 years ago

Adobe Acrobat possibly but this is not smth that will help here because the tool will write back the xreftable and the resulting file will probably validate fine afterwards.. - Hint :) I'll delete the file once the patches are ready - that's all I can tell you.

hochhaus commented 2 years ago

Adobe Acrobat possibly but this is not smth that will help here because the tool will write back the xreftable

Yeah, I was hoping for some tool which would perform a lower level replacement of the text. For example, replace each character in all of the text strings with a space in order to keep the rest of the PDF structure in place (without a rewrite of the xreftable since all offsets would be unmodified).

and the resulting file will probably validate fine afterwards.. - Hint :)

We could use one of many tools to fix the corrupted files and then pass them to pdfcpu. Doing so means that we can't use pdfcpu without first preprocessing all of the PDF data in another tool. At that point, it makes sense to do all of the PDF manipulation in whatever other tool that is (which isn't our first preference since we prefer golang libraries where possible).

Still, we might not have a choice on this point.

I'll delete the file once the patches are ready - that's all I can tell you.

In this case it is education data that I can't disclose due to US law so I really don't have flexibility here.

Let me explore this a bit more and if I can't get you better debug info we can close all three issues since they are not actionable from your side.

hhrutter commented 2 years ago

I added a useful log message. If you rerun doc5.pdf with the latest commit we may know more what's going on here.

hochhaus commented 2 years ago

Thanks @hhrutter! Please find the updated log attached.

$ ./pdfcpu validate -vv doc5.pdf &> doc5-crash-vv.log

doc5-crash-vv.log

hhrutter commented 2 years ago
Fatal: dict=rootDict required entry=Type missing
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.Dict.Entry

We can fix this, but before just so we know how much further this gets us during validation can you do me a favour, turn on parser logging, go install then rerun very verbose validation and post the unparsed string for obj #63

Enable parser logging like so: init.go: Uncomment line238

hochhaus commented 2 years ago

Thanks for the debug steps. Enabling parser logging and rerunning w/ very verbose validation is not a problem. I'm not sure how to find the unparsed object string for obj 63. Any hints?

TRACE: 2021/12/12 16:15:18 FindTableEntry: obj#:63 gen:0 
hhrutter commented 2 years ago

Have you run the command, I think it is hard to miss with all that output

hochhaus commented 2 years ago

Yes, I ran the command and nothing is obvious to me. Am I grepping on the wrong string?

$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep obj#:63
TRACE: 2021/12/12 16:23:41 FindTableEntry: obj#:63 gen:0 
TRACE: 2021/12/12 16:23:41 FindTableEntry: obj#:63 gen:0 
validation error (obj#:63)
hhrutter commented 2 years ago

Grep on: ParseObject: begin, obj#63

hhrutter commented 2 years ago

There must be smth like:

READ: 2021/12/12 23:25:08 dereferenceObject: begin, dereferencing object 63
READ: 2021/12/12 23:25:08 in use object 63
READ: 2021/12/12 23:25:08 dereferenceObject: dereferencing object 63
READ: 2021/12/12 23:25:08 ParseObject: begin, obj#63, offset:8206148
hochhaus commented 2 years ago

I'm a bit confused since the ParseObject: begin line appears to be a READ log (vs a PARSE log) which should be enabled by default (unless I'm misreading the logging code).

Either way, grepping on both obj#63 or ParseObject: begin returns no results.

$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep obj#63 | wc -l
0
$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep "ParseObject: begin" | wc -l
0
$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep "ParseObject: value"
PARSE: 2021/12/12 16:47:18 ParseObject: value = Name Object
PARSE: 2021/12/12 16:47:18 ParseObject: value = Array
PARSE: 2021/12/12 16:47:18 ParseObject: value = Array
PARSE: 2021/12/12 16:47:18 ParseObject: value = Name Object

(To ensure that my binary had correctly enabled PARSE logging and my grep was working as expected I also grepped on ParseObject: value)

Is it clear to you what I'm doing incorrectly?

hhrutter commented 2 years ago

Yes the grepping subject are READ log lines, but they are followed by the actual parsing so this definitely is helping.

As soon as you have log lines prefixed with PARSE, parse logging is enabled.

Any section around occurrences involving 63 or #63 are interesting. You may wanna do that manually.

I need to see the raw string for obj#63. If we can locate that, maybe you can check if there is personal confidential data before that at all and thus you could share the whole thing.

As a last resort you could use a Hex Editor and search for "63 0 obj"

hochhaus commented 2 years ago

Thanks for the pointers @hhrutter!

Any section around occurrences involving 63 or #63 are interesting.

I reviewed the output in emacs simply searching for "63" and (without really understanding how to read the output) I don't find anything that looks too promising. One possibility is:

READ: 2021/12/12 21:05:55 xref line 2: <<</Type/XRef/Size 67/Root 63 0 R/Info 64 0 R/ID[<557E5E853DDDDCD247CAD1FAABD9C195><557E5E853DDDDCD247CAD1FAABD9C195>]/W[0 3 0]/Filter/FlateDecode/Length 209>>stream>

Some other lines mention "63" as well but they appear to be parsing this object.

Using the hexeditor seems to give better results:

0003:8E90 |          36  33 20 30 20  6F 62 6A 0D  3C 3C 2F 54 |    63 0 obj.<</T
0003:8EA0 | 79 70 65 2F  43 61 74 61  6C 6F 67 2F  50 61 67 65 | ype/Catalog/Page
0003:8EB0 | 73 20 36 32  20 30 20 52  2F 4D 65 74  61 64 61 74 | s 62 0 R/Metadat
0003:8EC0 | 61 20 36 35  20 30 20 52  3E 3E 0D 65  6E 64 6F 62 | a 65 0 R>>.endob
0003:8ED0 | 6A                                                 | j               

However, this is a bit strange since searching the log file for "/Type/Catalog/Pages" returns zero results. Maybe this code path code not log?

hhrutter commented 2 years ago

No, that's ok actually. The catalog is a dictionary and will be pretty logged generally, in this case though we fail during parsing.

According to the hexdump the catalog looks like:

<<
    Type Catalog
    Pages (62 0 R)
    Metadata (65 0 R)
>>

According to the log the xref stream holds 67 objects but only the first 23 get extracted and that's the reason why the object #63 holding the catalog can't be located.

Still don't know why, need to tinker some more.