modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.97k stars 376 forks source link

Page unit conversion to PDF points #35

Open rkpatel33 opened 9 years ago

rkpatel33 commented 9 years ago

@richardlitt and I are also having a little understanding the 'page unit', the coordinate convention and how that relates to PDF points (8.5" x 11" = 612 x 792 points). Can you provide a little clarification?

SPlatten commented 7 years ago

To translate the page units to any coordinate system do this:

Take the width and height returned in the parsed pdf object, then work out scale factors according to the units of measure you desire, so if the width of your page in mm is 290 and the height is 297 mm, then:

       scale x = pdf width / 290
       scale y = pdf height / 297

Then correct x and y for any text object

      x = text x * scale x
      y = text y * scale y
matthewerwin commented 7 years ago

@rkpatel33, what @SPlatten wrote is the general case. Since you asked how it relates specifically to 72dpi for pdf2json with scaling to 8.5x11:

//scaled to 72 DPI from whatever resolution the parsed version is in
x72dpi = x * (8.5 * 72) / parsedPdf.formImage.Width;
y72dpi = y * (11 * 72) / parsedPdf.formImage.Pages[0].Height; //assuming all pages are same height
RichardLitt commented 7 years ago

This issue can most likely be closed. AFAIK @rkpatel33 is no longer interested in this issue.

matthewerwin commented 7 years ago

@richardlitt agreed. This is more of a help question than an issue but wanted to provide a concrete example for anyone else that trips across this as I did.

jdwilsonjr commented 6 years ago

Hello gentlemen,

I am keenly interested in this issue and hopeful you are still in the neighborhood.

To wit, I am working with the A4 paper size which measures:

pdf2json is telling me the these are the dimension:

What, pray tell, ARE these "units" called in pdf2json ?

Thank you in advance for your time and attention.

matthewerwin commented 6 years ago

@jdwilsonjr the units are relative to the scale of the page. As the page scales the units should be scaled by the same. In simple terms you are just scaling the positional units by the ratio between the source page height/width and the destination page height/width.

In your case: destination A4 paper is 72dpi (595/72 ~= 8.264, 842/72 = 11.69 inches), source paper is pdf2json which 37.188x52.625. You don't need to worry about the DPI of pdf2json here because even if it was 300dpi the x/y numbers would just be double what they'd be at 150dpi. Your only concern is scaling them by the same amount you're scaling the page.

x72dpi = pdf2json.x *  595 / 37.188000; // aka (8.27 * 72) / 37.188
y72dpi = pdf2json.y * 842 / 52.625000; // aka (11.69 * 72) / 52.625
jdwilsonjr commented 6 years ago

@matthewerwin thanks for the response.

Hmmm, so are you saying that the pdf2json "paper size" is (relatively) 37.1875 inches by 52.625 inches and that is always used irrespective of the actual document dpi ?

matthewerwin commented 6 years ago

@jdwilsonjr no. I'm saying the units used in pdf2json cancel out. When you divide the x coordinate by the width the page the units cancel just like they do when converting english units to metric units in elementary math.

i.e. converting from metric centimeters to English system inches

(50 cm * 1in) / 2.54cm = 19.685in

Does that clarify why the units used by pdf2json don't matter and you only need to be concerned about how that ratio relates to your desired destination units?

jdwilsonjr commented 6 years ago

Ok. Thanks.

szabbal commented 3 years ago

Hello,

I am writing some tests to verify the size of some pdf files (A5, A4, A3 etc). The width and height returned by pdf2json in case of A4 is 37.185 x 52.62. The A4 paper has the width 595px, 595 / 31.185 = 16.00 As far as I've seen for all page sizes the scale is ~16.00. Is there a way to obtain that 16.00 scale factor programmatically, so that I can have something like this in my tests:

pdf2jsonWidthA4 * scale = pdfWidthA4
pdf2jsonHeigthA3 * scale = pdfHeightA3
etc

which will tell if the pdfs have the desired dimensions.

Thank you!

matthewerwin commented 3 years ago

@szabbal A4 paper doesn't have a fixed pixel width. Pixels are a product of "resolution" -- pixels are dots and are fundamentally fixed by the DPI of the device which displays/captures them (even if that "device" is virtual/memory/disk storage).

If you want to determine if something is A4 paper then it's a question of inches -- not pixels. You need to check aspect ratio to determine it's "A" paper and not "B" or "C" paper -- then use the width & height in inches to determine if it's A4 or A5, or A6, etc.

Might help you to read this to get the relationship of units mentally sorted out:

https://stackoverflow.com/a/46105049/1513347

szabbal commented 3 years ago

Hi @matthewerwin, thank you for your reply. Indeed measuring paper width and height makes more sense with inches, thank you for clarifying that. However something is still not clear for me (sorry if this was already asked): I am using pdf2json to determine if some pdf files have the size of A0, A1, A2 etc. For all of my test files pdf2json returns a width and height (in page unit):

For all of them the width/height ration is 0.70.

All of the above dimensions are proportional to the width and height of standard A0-6 papers, and in all cases that proportion is 1/16 (with DPI 72). Consider the following example for A4:

width = (scale * pdf2jonWidth) / DPI
width = (16 * 37.185) / 72
width = 8.26 - more or less the width of A4 which is 8.3
height = (scale * pdf2jsonHeight) / DPI
height = (16 * 52.62) / 72
height = 11.69 - which is again more or less the height of A4 (11.7 inch)

I would like to have the same results but without that magic number 16, I would like to know where that 16 is coming from and if I can get it programatically.

Thank you!