modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
2.02k stars 377 forks source link

Removing Spaces For No Reason #361

Open KarlGNassar opened 2 months ago

KarlGNassar commented 2 months ago

Is there a way to disable the removal of whitespaces?


When I use pdfjs-dist I get this:

    {
        "str": "Exchange income, net",
        "dir": "ltr",
        "width": 98.45999999999992,
        "height": 9,
        "transform": [9, 0, 0, 9, 85.03940000000011, 400.8149999999999],
        "fontName": "g_d0_f1",
        "hasEOL": false
    },
    {
        "str": " ",
        "dir": "ltr",
        "width": 25.51200000000001,
        "height": 0,
        "transform": [9, 0, 0, 9, 183.4994, 400.8149999999999],
        "fontName": "g_d0_f1",
        "hasEOL": false
    },
    {
        "str": "1,246,450",
        "dir": "ltr",
        "width": 40.56300000000019,
        "height": 9,
        "transform": [9, 0, 0, 9, 413.1074000000001, 400.8149999999999],
        "fontName": "g_d0_f3",
        "hasEOL": false
    },

As you can see, there is a space between the first and last text


But when I use pdf2json I get this:

    {
        "x": 5.065,
        "y": 23.805,
        "w": 11,
        "oc": "#1f12ff",
        "sw": 0.33853125,
        "A": "left",
        "R": [{ "T": "Exchange income, net", "S": 3, "TS": [0, 12, 0, 0] }]
    },
    {
        "x": 25.569,
        "y": 23.805,
        "w": 4.532,
        "oc": "#1f12ff",
        "sw": 0.3125,
        "A": "left",
        "R": [{ "T": "1,246,450", "S": 3, "TS": [0, 12, 0, 0] }]
    },

Any configuration that I'm missing?

KarlGNassar commented 2 months ago

image Columns are so far from each other, and a space in a CSV really matters.

KarlGNassar commented 3 weeks ago

you think this would work on a devloper? really?.. @ummm288