smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.4k stars 535 forks source link

Coordinates unit of measure and reverse method of getDataTm #548

Open luigibriganti opened 2 years ago

luigibriganti commented 2 years ago

Hi, great work! I have two questions: 1) which is the unit of measure of the coordinates collected by getDataTm? 2) is there a method to find a text portion providing the coordinates?

Thanks!

k00ni commented 2 years ago

@izabala or @Connum might have some insight here.

izabala commented 2 years ago

Hi,

  1. The answer is not a simple one. If you have a look to the Document management — Portable document format — Part 1:PDF 1.7 in page 116 it says:

    The coordinates of text shall be specified in text space. The transformation from text space to user space shall be defined by a text matrix in combination with several text-related parameters in the graphics state (see 9.4.2, "Text-Positioning Operators").

And in the page 249 it says:

9.4.2 Text-Positioning Operators Text space is the coordinate system in which text is shown. It shall be defined by the text matrix, Tm, and the text state parameters Tfs, Th, and Trise, which together shall determine the transformation from text space to user space. Specifically, the origin of the first glyph shown by a text-showing operator shall be placed at the origin of text space. If text space has been translated, scaled, or rotated, then the position, size, or orientation of the glyph in user space shall be correspondingly altered.

So, the unit depends on many variables, an its values will depends also, on where is the (0,0) point. What I usually do, is use getDataTme to parse one example of the pdfs files that I will use. It will return a list of item with 6 numbers an the text. The last 2 numbers are the X,Y position of that text. (The first 4 numbers has to be with Scalation, Rotation and Skew of the text.) If the page has text that is not a horizontal line, maybe vertical, you have to take a look also to the first 4 numbers. So with this 2 numbers you should have an approximation of where the text should be.

  1. Yes, you can use getTextXY (Gets text data that are around the given coordinates (X,Y)). Here I give you a description of the getTextXY method:
     * Gets text data that are around the given coordinates (X,Y)
     *
     * If the text is in near the given coordinates (X,Y) (or the TM info),
     * the text is returned.  The extractedData return by getDataTm, could be use to see
     * where is the coordinates of a given text, using the TM info for it.
     *
     * @param float $x      The X value of the coordinate to search for. if null
     *                      just the Y value is considered (same Row)
     * @param float $y      The Y value of the coordinate to search for
     *                      just the X value is considered (same column)
     * @param float $xError The value less or more to consider an X to be "near"
     * @param float $yError The value less or more to consider an Y to be "near"
     *
     * @return array An array of text that are near the given coordinates. If no text
     *               "near" the x,y coordinate, an empty array is returned. If Both, x
     *               and y coordinates are null, null is returned.
     */

    So after you use the getDataTm to see where should be the text (approximately), you can use getTextXY to extract text of other files that has the "same" format, using the $xError and $yError to define what is near enough to the coordinates to consider that should be consider to be part of the text. This is important, because the text are in boxes, so text that are longer or shorter change the value of the coordinates that you get with getDataTm.

I hope this could help you!