Open luigibriganti opened 2 years ago
@izabala or @Connum might have some insight here.
Hi,
The coordinates of text shall be specified in text space. The transformation from text space to user space shall be defined by a text matrix in combination with several text-related parameters in the graphics state (see 9.4.2, "Text-Positioning Operators").
And in the page 249 it says:
9.4.2 Text-Positioning Operators Text space is the coordinate system in which text is shown. It shall be defined by the text matrix, Tm, and the text state parameters Tfs, Th, and Trise, which together shall determine the transformation from text space to user space. Specifically, the origin of the first glyph shown by a text-showing operator shall be placed at the origin of text space. If text space has been translated, scaled, or rotated, then the position, size, or orientation of the glyph in user space shall be correspondingly altered.
So, the unit depends on many variables, an its values will depends also, on where is the (0,0) point. What I usually do, is use getDataTme to parse one example of the pdfs files that I will use. It will return a list of item with 6 numbers an the text. The last 2 numbers are the X,Y position of that text. (The first 4 numbers has to be with Scalation, Rotation and Skew of the text.) If the page has text that is not a horizontal line, maybe vertical, you have to take a look also to the first 4 numbers. So with this 2 numbers you should have an approximation of where the text should be.
* Gets text data that are around the given coordinates (X,Y)
*
* If the text is in near the given coordinates (X,Y) (or the TM info),
* the text is returned. The extractedData return by getDataTm, could be use to see
* where is the coordinates of a given text, using the TM info for it.
*
* @param float $x The X value of the coordinate to search for. if null
* just the Y value is considered (same Row)
* @param float $y The Y value of the coordinate to search for
* just the X value is considered (same column)
* @param float $xError The value less or more to consider an X to be "near"
* @param float $yError The value less or more to consider an Y to be "near"
*
* @return array An array of text that are near the given coordinates. If no text
* "near" the x,y coordinate, an empty array is returned. If Both, x
* and y coordinates are null, null is returned.
*/
So after you use the getDataTm to see where should be the text (approximately), you can use getTextXY to extract text of other files that has the "same" format, using the $xError and $yError to define what is near enough to the coordinates to consider that should be consider to be part of the text. This is important, because the text are in boxes, so text that are longer or shorter change the value of the coordinates that you get with getDataTm.
I hope this could help you!
Hi, great work! I have two questions: 1) which is the unit of measure of the coordinates collected by getDataTm? 2) is there a method to find a text portion providing the coordinates?
Thanks!