modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.98k stars 379 forks source link

'w' doesn't make sense #136

Open SPlatten opened 7 years ago

SPlatten commented 7 years ago

Looking at a specific text object, the width makes no sense at all, the width returned for the text:

LINE%20START

is 55.044, why and how, the font size is 10pt and the page width is only 37.188, so what and how is 'w' calculated?

aeyrium commented 6 years ago

I also would like to understand the relationship between x and y position values and the width units for text. This would be particularly useful in determining when to merge text. For example, x + w should give me the x for the adjacent text.

RomainHautefeuille commented 6 years ago

I ended up using text.w/2 instead of just text.w to have consistent text width compared to other fills and lines coordinates. so far it's working.

mattsoftware commented 4 years ago

Has anyone worked this out yet? I too would like the 'width' of the text component in a sane format so I can build a bounding box around the text object.

wvanrensburg-zywave commented 2 years ago

I managed to figure this out using some pdfs I created with characters printed randomly on the page.

'w' appears to be in points. x, y, page width and page height are all in Page Units, but w for some reasons, was built up in points.

It also randomly appeared to me that converting Page Units to points, was simply multiplying Page Units by 16.

I tested on a few random PDFs, and it ended up being accurate.

austenstrine commented 8 months ago

I managed to figure this out using some pdfs I created with characters printed randomly on the page.

'w' appears to be in points. x, y, page width and page height are all in Page Units, but w for some reasons, was built up in points.

It also randomly appeared to me that converting Page Units to points, was simply multiplying Page Units by 16.

I tested on a few random PDFs, and it ended up being accurate.

This appears to be correct. It worked for me too.