Open hochhaus opened 2 years ago
Hi there and thanks for considering pdfcpu!
Unfortunately all three categories of validation errors reported are nothing I can fix quickly & remotely.
We'll probably need some sort of workaround (we already have a couple of those in place) if you claim most of PDF viewers don't seem to have a problem with the files in question. If they do render in Acrobat Reader and Mac Preview I'll definitely provide a fix - that's pdfcpu's commitment.
In order to get working on this I will need to get a hand on the files spotting the problem. You can go ahead and send them to my private email which you'll find on my profile page.
Thanks for investigating @hhrutter!
If they do render in Acrobat Reader and Mac Preview I'll definitely provide a fix - that's pdfcpu's commitment.
At the moment I only have access to a linux machine so I only verified they were visible in PDF.js (FF viewer), pdfium (chrome viewer) and evince (gnome viewer). I'll verify Acrobat Reader and Mac Preview shortly and report my findings back.
In order to get working on this I will need to get a hand on the files spotting the problem. You can go ahead and send them to my private email which you'll find on my profile page.
Unfortunately, the files contain PII that I'm prohibited from disclosing (even only to you for debugging). Worse yet, they are user submitted so I have very limited information about how they were generated. I can tell the following from document meta data:
$ pdfinfo doc1.pdf
...
Creator: Adobe Acrobat Pro DC 18.11.20035
...
$ pdfinfo doc3.pdf
...
Producer: ... using ABCpdf
...
$ pdfinfo doc5.pdf
...
Producer: DocHub 1.2.3
...
I know this isn't very helpful though in terms of your debugging / creating alternate files to debug from. Can you think of any other way I could get you the required information to debug? Do you know of a tool which would zero out / redact all of the text in the document such that I could submit the files to you?
Adobe Acrobat possibly but this is not smth that will help here because the tool will write back the xreftable and the resulting file will probably validate fine afterwards.. - Hint :) I'll delete the file once the patches are ready - that's all I can tell you.
Adobe Acrobat possibly but this is not smth that will help here because the tool will write back the xreftable
Yeah, I was hoping for some tool which would perform a lower level replacement of the text. For example, replace each character in all of the text strings with a space in order to keep the rest of the PDF structure in place (without a rewrite of the xreftable since all offsets would be unmodified).
and the resulting file will probably validate fine afterwards.. - Hint :)
We could use one of many tools to fix the corrupted files and then pass them to pdfcpu. Doing so means that we can't use pdfcpu without first preprocessing all of the PDF data in another tool. At that point, it makes sense to do all of the PDF manipulation in whatever other tool that is (which isn't our first preference since we prefer golang libraries where possible).
Still, we might not have a choice on this point.
I'll delete the file once the patches are ready - that's all I can tell you.
In this case it is education data that I can't disclose due to US law so I really don't have flexibility here.
Let me explore this a bit more and if I can't get you better debug info we can close all three issues since they are not actionable from your side.
I added a useful log message. If you rerun doc5.pdf with the latest commit we may know more what's going on here.
Thanks @hhrutter! Please find the updated log attached.
$ ./pdfcpu validate -vv doc5.pdf &> doc5-crash-vv.log
Fatal: dict=rootDict required entry=Type missing
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.Dict.Entry
We can fix this, but before just so we know how much further this gets us during validation can you do me a favour, turn on parser logging, go install then rerun very verbose validation and post the unparsed string for obj #63
Enable parser logging like so: init.go: Uncomment line238
Thanks for the debug steps. Enabling parser logging and rerunning w/ very verbose validation is not a problem. I'm not sure how to find the unparsed object string for obj 63. Any hints?
TRACE: 2021/12/12 16:15:18 FindTableEntry: obj#:63 gen:0
Have you run the command, I think it is hard to miss with all that output
Yes, I ran the command and nothing is obvious to me. Am I grepping on the wrong string?
$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep obj#:63
TRACE: 2021/12/12 16:23:41 FindTableEntry: obj#:63 gen:0
TRACE: 2021/12/12 16:23:41 FindTableEntry: obj#:63 gen:0
validation error (obj#:63)
Grep on: ParseObject: begin, obj#63
There must be smth like:
READ: 2021/12/12 23:25:08 dereferenceObject: begin, dereferencing object 63
READ: 2021/12/12 23:25:08 in use object 63
READ: 2021/12/12 23:25:08 dereferenceObject: dereferencing object 63
READ: 2021/12/12 23:25:08 ParseObject: begin, obj#63, offset:8206148
I'm a bit confused since the ParseObject: begin
line appears to be a READ log (vs a PARSE log) which should be enabled by default (unless I'm misreading the logging code).
Either way, grepping on both obj#63
or ParseObject: begin
returns no results.
$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep obj#63 | wc -l
0
$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep "ParseObject: begin" | wc -l
0
$ ./pdfcpu validate -vv doc5.pdf 2>&1 | grep "ParseObject: value"
PARSE: 2021/12/12 16:47:18 ParseObject: value = Name Object
PARSE: 2021/12/12 16:47:18 ParseObject: value = Array
PARSE: 2021/12/12 16:47:18 ParseObject: value = Array
PARSE: 2021/12/12 16:47:18 ParseObject: value = Name Object
(To ensure that my binary had correctly enabled PARSE logging and my grep was working as expected I also grepped on ParseObject: value
)
Is it clear to you what I'm doing incorrectly?
Yes the grepping subject are READ log lines, but they are followed by the actual parsing so this definitely is helping.
As soon as you have log lines prefixed with PARSE, parse logging is enabled.
Any section around occurrences involving 63 or #63 are interesting. You may wanna do that manually.
I need to see the raw string for obj#63. If we can locate that, maybe you can check if there is personal confidential data before that at all and thus you could share the whole thing.
As a last resort you could use a Hex Editor and search for "63 0 obj"
Thanks for the pointers @hhrutter!
Any section around occurrences involving 63 or #63 are interesting.
I reviewed the output in emacs simply searching for "63" and (without really understanding how to read the output) I don't find anything that looks too promising. One possibility is:
READ: 2021/12/12 21:05:55 xref line 2: <<</Type/XRef/Size 67/Root 63 0 R/Info 64 0 R/ID[<557E5E853DDDDCD247CAD1FAABD9C195><557E5E853DDDDCD247CAD1FAABD9C195>]/W[0 3 0]/Filter/FlateDecode/Length 209>>stream>
Some other lines mention "63" as well but they appear to be parsing this object.
Using the hexeditor seems to give better results:
0003:8E90 | 36 33 20 30 20 6F 62 6A 0D 3C 3C 2F 54 | 63 0 obj.<</T
0003:8EA0 | 79 70 65 2F 43 61 74 61 6C 6F 67 2F 50 61 67 65 | ype/Catalog/Page
0003:8EB0 | 73 20 36 32 20 30 20 52 2F 4D 65 74 61 64 61 74 | s 62 0 R/Metadat
0003:8EC0 | 61 20 36 35 20 30 20 52 3E 3E 0D 65 6E 64 6F 62 | a 65 0 R>>.endob
0003:8ED0 | 6A | j
However, this is a bit strange since searching the log file for "/Type/Catalog/Pages" returns zero results. Maybe this code path code not log?
No, that's ok actually. The catalog is a dictionary and will be pretty logged generally, in this case though we fail during parsing.
According to the hexdump the catalog looks like:
<<
Type Catalog
Pages (62 0 R)
Metadata (65 0 R)
>>
According to the log the xref stream holds 67 objects but only the first 23 get extracted and that's the reason why the object #63 holding the catalog can't be located.
Still don't know why, need to tinker some more.
Following up on #399, the third class of panics was found in one document.
Attached, please find the output from the following commands:
doc5-crash-v.log doc5-crash-vv.log
These PDFs display correctly in PDF.js, pdfium and evince.