Closed nschloe closed 5 years ago
You would want to use getTextContent() https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L812 instead -- it is used by TextLayerBuilder. TextItem has transform and width/height https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L567. Does this help?
That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?
That's not a use case that needed for PDF Viewer (yet). You have to implemented it on your side.
@yurydelendik Thanks, that'll do the trick.
Is there documentation on transform
(an array of length 6)? I suppose those are entries in a transformation matrix, but I'm not sure which.
I just found CSS3's transform
and assume that the same logic is applied.
http://www.w3schools.com/cssref/css3_pr_transform.asp
https://dev.opera.com/articles/understanding-the-css-transforms-matrix/
Is the unit of the translations is px
or some PDF-related unit (e.g., in
)?
PDF user unit, you have to use PageViewport to map it to the screen presentation, see e.g. https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159
Hm, I can't seem to get it right. The transform property on the individual element reads
element transform: [9.9626, 0, 0, 9.9626, 74.558, 673.034]
where I don't know how to to interpret it. I'm certainly not supposed to stretch the element by a factor of almost 10, right? Curiously, the 9.9626
always coincide with the height
property. What to make of this?
Sorry we did not extend the documentation of those getTextContent that far (hoping we can improve the API). Currently we operate under assumption that user of the advanced features will have some knowledge in computer graphics. Let me help, based on https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159 code, and let's start from simple example:
var text; // pdfPage.getTextContent(function (t) { text = t; });
var viewport = pdfPage.getViewport(1.0 /* scale */);
// find all text start points
var xs = [], ys = [];
text.items.forEach(function (item) {
var tx = PDFJS.Util.transform(viewport.transform, item.transform);
// avoiding tx * [0,0,1] taking x, y directly from the transform
xs.push(tx[4]); ys.push(tx[5]);
});
var boundsOfStartPoints = [Math.min.apply(null, xs), Math.min.apply(null, ys), Math.max.apply(null, xs), Math.max.apply(null, ys)];
But your task you need to take in account all four points of the rectangle, and tx calculation must be performed a little bit different. It's unfortunate width and height are scaled by fontHeight, but you can easily revert that and calculate position of the [0,0], [width/fontHeight,0], [0,height/fontHeight], [width/fontHeight, height/fontHeight] for bounds calculation.
It's unfortunate width and height are scaled by fontHeight,
Aha, I didn't know that. How do I retrieve the font height of an object? Does that coincide with content.items[i].height
?
From https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164: var fontHeight = Math.sqrt((tx[2] * tx[2]) + (tx[3] * tx[3]));
When inspecting tx[4]
of an element on on the PDF, I notice that the actual x-y-pixel-coordinates of the top-left point do not coincide with tx[4]
as computed from
tx = PDFJS.Util.transform(
viewport.transform,
content.items[i].transform
);
See:
tx[4] = 76.6...
while the actual translation in x-direction exceeds 100px.
Any idea what might be causing this?
I created a little plunk at http://plnkr.co/edit/m9oPJg80XHeeQ1nquxkz which confirms exactly what you're saying, @yurydelendik. However, in my full application, I still get the wrong offsets. I'm investigating why, and it might be due to the pdf_viewer
I'm using. I'd like to create a plunk with it, too, but I can't find pdf_viewer.js
served from github.io like http://mozilla.github.io/pdf.js/build/pdf.js or http://mozilla.github.io/pdf.js/build/pdf.worker.js. Any pointer?
@nschloe it seems like you did some work for text extraction. How far did you come?
Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn't make any sense to me because it sets a scale in both the x and y directions equal to font height. I don't know why you'd ever want to do canvas operations in those units. But that's beside the point.)
The numbers that matter are:
const item = textContent.items[0];
const transform = item.transform;
const x = transform[4];
const y = transform[5];
const width = item.width;
const height = item.height;
In order to do any operations on the canvas using these values, you have to (1) fix the y
coordinate from the PDF origin to the canvas origin and (2) scale the whole thing by whatever you've scaled the viewport (i.e. by whatever you passed getViewport). So I do this:
convertToCanvasCoords([x, y, width, height]) {
const { scale } = this;
return [x * scale, this.canvas.height - ((y + height) * scale), width * scale, height * scale];
}
Where this.scale
is the same number I passed to getViewport. Then the following draws an accurate box around the text:
ctx.strokeRect(...this.convertToCanvasCoords([x, y, width, height]));
Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There's probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what's going on. :) Hope that helps.
For others digging around for what pdf.js
is actually doing with transformation vectors, the PDF Reference includes a definition of how transformation vectors are laid out and how they relate to mapping into a two dimensional coordinate space.
Specifically, the components of a transformation matrix are described on page 142:
- Translations are specified as
[ 1 0 0 1 tx ty ]
, where tx and ty are the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively.- Scaling is obtained by
[sx 0 0 sy 0 0]
. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as sx and sy units, respectively, in the previous coordinate system.- Rotations are produced by
[cos θ sin θ −sin θ cos θ 0 0]
, which has the effect of rotating the coordinate system axes by an angleθ
counterclockwise.- Skew is specified by
[1 tan α tan β 1 0 0]
, which skews thex
axis by an angleα
and they
axis by an angleβ
.
(there's an accompanying chart in the reference as well)
And the vector itself is defined thus:
PDF represents coordinates in a two-dimensional space. The point
(x, y)
in such a space can be expressed in vector form as[x y 1]
. The constant third element of this vector(1)
is needed so that the vector can be used with 3-by-3 matrices in the calculations described below. The transformation between two coordinate systems is represented by a 3-by-3 transformation matrix written as[ed: pretend this is a matrix]
a b 0 c d 0 e f 1 Because a transformation matrix has only six elements that can be changed, it is usually specified in PDF as the six-element array
[a b c d e f]
.
Closing since this is answered now.
I have a use case where I would like to get the bounding box of each word within an item. For example, the str of an item can be "hello world!", but the transformation only gives the coordinates of the entire string. In my use case, I would like to get the coordinates of each of the "hello", "world", and "!". These words are not selected or highlighted.
@haijun-ucsd have you found a way to achieve this?
@haijun-ucsd have you found a way to achieve this? @BernhardBehrendt You could do something like this:
x1-x0
I would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their
div
children (which represent rows of text). That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from thePDFJS
object?