modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.98k stars 378 forks source link

fields with periods are truncated #324

Open terrafrost opened 8 months ago

terrafrost commented 8 months ago

So I have a PDF with just one field on it - a field named "xxx.yyy". When I run pdf2json 3.0.5 on the PDF I'm told that the only field on that PDF is "yyy".

test.pdf demonstrates the problem.

Here's what Adobe Acrobat Pro 2020 shows:

image

pdftk 2.02 also finds "xxx.yyy" when I run pdftk test.pdf dump_data_fields:

FieldType: Text
FieldName: xxx.yyy
FieldFlags: 0
FieldJustification: Left

Unfortunately, pdftk doesn't return the coordinates whereas pdf2json does.

According to qpdf test.pdf --json the field's alternativename, fullname and mappingname are "xxx.yyy" whereas the partialname is "yyy" so maybe that's the issue?

terrafrost commented 8 months ago

So I used qpdf's QDF mode (qpdf test.pdf --qdf test.qdf) to further dig into this and I guess the issue is that when there are dots the dots are treated as parent objects.

%% Object stream: object 7, index 2; original object ID: 24
<<
  /DA (/Helv 12 Tf 0 g)
  /F 4
  /FT /Tx
  /MK <<
  >>
  /P 21 0 R
  /Parent 17 0 R
  /Rect [
    190.784
    658.903
    340.784
    680.903
  ]
  /Subtype /Widget
  /T (yyy)
  /Type /Annot
>>

So if you look at the /T tag in isolation you get yyy. The xxx is due to the /Parent 17 0 R bit:

%% Object stream: object 17, index 0; original object ID: 10
<<
  /Kids [
    7 0 R
  ]
  /T (xxx)
>>

So I guess what pdf2json needs to do is to recursively go back and find each parent until there is no parent and it needs to prepend each parent to the /T tag with dots separating each part.