modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
2.02k stars 377 forks source link

some fields don't have coordinates and idk wy #369

Open terrafrost opened 1 month ago

terrafrost commented 1 month ago

Consider this PDF:

3fields.pdf

It clearly has 3x fields at very specific locations:

Screenshot 2024-10-02 010102

But only the first field actually has x,y coordinates when I do pdf2json -f 3fields.pdf:

{"Transcoder":"pdf2json@3.1.4 [https://github.com/modesty/pdf2json]","Meta":{"PDFFormatVersion":"1.6","IsAcroFormPresent":true,"IsXFAPresent":false,"Creator":"pdftk 1.45 - www.pdftk.com","Producer":"itext-paulo-155 (itextpdf.sf.net-lowagie.com)","CreationDate":"D:20231222121749-06'00'","ModDate":"D:20241002005743-06'00'","Metadata":{"xmp:createdate":"2023-12-22T12:17:49-06:00","xmp:creatortool":"pdftk 1.45 - www.pdftk.com","xmp:modifydate":"2024-10-02T00:57:43-05:00","xmp:metadatadate":"2024-10-01T07:50:35-05:00","pdf:producer":"itext-paulo-155 (itextpdf.sf.net-lowagie.com)","dc:format":"application/pdf","xmpmm:documentid":"uuid:db2d6562-396f-4f07-a769-f3eff65a8942","xmpmm:instanceid":"uuid:41449369-91ef-421e-8229-8388c461a85e","adhocwf:state":"1","adhocwf:version":"1.1"}},"Pages":[{"Width":38.25,"Height":49.5,"HLines":[],"VLines":[],"Fills":[],"Texts":[],"Fields":[{"style":48,"T":{"Name":"alpha","TypeInfo":{}},"id":{"Id":"incomeName1","EN":0},"TI":0,"AM":0,"x":12.834,"y":3.287,"w":20.934,"h":1.363},{"style":48,"T":{"Name":"alpha","TypeInfo":{}},"id":{"Id":"PatientsIncomeSS","EN":0},"TI":1,"AM":0,"x":null,"y":null,"w":null,"h":0.833},{"style":48,"T":{"Name":"alpha","TypeInfo":{}},"id":{"Id":"SSALetterYear","EN":0},"TI":2,"AM":0,"x":null,"y":null,"w":null,"h":0.833}],"Boxsets":[]}]}

Here's the formatted portion of the most relevant part:

      "Fields": [
        {
          "style": 48,
          "T": {
            "Name": "alpha",
            "TypeInfo": {}
          },
          "id": {
            "Id": "incomeName1",
            "EN": 0
          },
          "TI": 0,
          "AM": 0,
          "x": 12.834,
          "y": 3.287,
          "w": 20.934,
          "h": 1.363
        },
        {
          "style": 48,
          "T": {
            "Name": "alpha",
            "TypeInfo": {}
          },
          "id": {
            "Id": "PatientsIncomeSS",
            "EN": 0
          },
          "TI": 1,
          "AM": 0,
          "x": null,
          "y": null,
          "w": null,
          "h": 0.833
        },
        {
          "style": 48,
          "T": {
            "Name": "alpha",
            "TypeInfo": {}
          },
          "id": {
            "Id": "SSALetterYear",
            "EN": 0
          },
          "TI": 2,
          "AM": 0,
          "x": null,
          "y": null,
          "w": null,
          "h": 0.833
        }
      ],

Note how x, y and w are null for PatientsIncomeSS and SSALetterYear.

Any ideas why this is? Is there something I can do differently to make these 2x fields show x, y and w? Or is this a bug in pdf2json for which no workaround exists?

I'm running pdf2json 3.1.4.