mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
47.49k stars 9.86k forks source link

Retrieve bounding box of text on a page #5643

Closed nschloe closed 5 years ago

nschloe commented 9 years ago

I would like to determine the margins of the text in a PDF document. One possibility would be to render the PDF and look at the text layer of each page, specifically the positionins of their div children (which represent rows of text). That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

yurydelendik commented 9 years ago

You would want to use getTextContent() https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L812 instead -- it is used by TextLayerBuilder. TextItem has transform and width/height https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L567. Does this help?

That strikes me as a little too cumbersome, though. Is there a way to retrieve the bounding box of all text on a page from the PDFJS object?

That's not a use case that needed for PDF Viewer (yet). You have to implemented it on your side.

nschloe commented 9 years ago

@yurydelendik Thanks, that'll do the trick. Is there documentation on transform (an array of length 6)? I suppose those are entries in a transformation matrix, but I'm not sure which.

nschloe commented 9 years ago

I just found CSS3's transform and assume that the same logic is applied. http://www.w3schools.com/cssref/css3_pr_transform.asp https://dev.opera.com/articles/understanding-the-css-transforms-matrix/

nschloe commented 9 years ago

Is the unit of the translations is px or some PDF-related unit (e.g., in)?

yurydelendik commented 9 years ago

PDF user unit, you have to use PageViewport to map it to the screen presentation, see e.g. https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159

nschloe commented 9 years ago

Hm, I can't seem to get it right. The transform property on the individual element reads

element transform: [9.9626, 0, 0, 9.9626, 74.558, 673.034]

where I don't know how to to interpret it. I'm certainly not supposed to stretch the element by a factor of almost 10, right? Curiously, the 9.9626 always coincide with the height property. What to make of this?

yurydelendik commented 9 years ago

Sorry we did not extend the documentation of those getTextContent that far (hoping we can improve the API). Currently we operate under assumption that user of the advanced features will have some knowledge in computer graphics. Let me help, based on https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L159 code, and let's start from simple example:

var text; // pdfPage.getTextContent(function (t) { text = t; });
var viewport = pdfPage.getViewport(1.0 /* scale */);
// find all text start points
var xs = [], ys = [];
text.items.forEach(function (item) {
  var tx = PDFJS.Util.transform(viewport.transform, item.transform);
  // avoiding tx * [0,0,1] taking x, y directly from the transform
  xs.push(tx[4]); ys.push(tx[5]);
});
var boundsOfStartPoints = [Math.min.apply(null, xs), Math.min.apply(null, ys), Math.max.apply(null, xs), Math.max.apply(null, ys)];

But your task you need to take in account all four points of the rectangle, and tx calculation must be performed a little bit different. It's unfortunate width and height are scaled by fontHeight, but you can easily revert that and calculate position of the [0,0], [width/fontHeight,0], [0,height/fontHeight], [width/fontHeight, height/fontHeight] for bounds calculation.

nschloe commented 9 years ago

It's unfortunate width and height are scaled by fontHeight,

Aha, I didn't know that. How do I retrieve the font height of an object? Does that coincide with content.items[i].height?

yurydelendik commented 9 years ago

From https://github.com/mozilla/pdf.js/blob/master/web/text_layer_builder.js#L164: var fontHeight = Math.sqrt((tx[2] * tx[2]) + (tx[3] * tx[3]));

nschloe commented 9 years ago

When inspecting tx[4] of an element on on the PDF, I notice that the actual x-y-pixel-coordinates of the top-left point do not coincide with tx[4] as computed from

tx = PDFJS.Util.transform(
  viewport.transform,
  content.items[i].transform
);

See: prec

tx[4] = 76.6...

while the actual translation in x-direction exceeds 100px.

Any idea what might be causing this?

nschloe commented 9 years ago

I created a little plunk at http://plnkr.co/edit/m9oPJg80XHeeQ1nquxkz which confirms exactly what you're saying, @yurydelendik. However, in my full application, I still get the wrong offsets. I'm investigating why, and it might be due to the pdf_viewer I'm using. I'd like to create a plunk with it, too, but I can't find pdf_viewer.js served from github.io like http://mozilla.github.io/pdf.js/build/pdf.js or http://mozilla.github.io/pdf.js/build/pdf.worker.js. Any pointer?

TuningGuide commented 8 years ago

@nschloe it seems like you did some work for text extraction. How far did you come?

jmlsf commented 7 years ago

Here is how I approached this. First, I ignore the scale factor in the textContent transform. (The transformation matrix provided in the textContent.item[x].transform doesn't make any sense to me because it sets a scale in both the x and y directions equal to font height. I don't know why you'd ever want to do canvas operations in those units. But that's beside the point.)

The numbers that matter are:

const item = textContent.items[0];
const transform = item.transform;
const x = transform[4];
const y = transform[5];
const width = item.width;
const height = item.height;

In order to do any operations on the canvas using these values, you have to (1) fix the y coordinate from the PDF origin to the canvas origin and (2) scale the whole thing by whatever you've scaled the viewport (i.e. by whatever you passed getViewport). So I do this:

convertToCanvasCoords([x, y, width, height]) {
  const { scale } = this;
  return [x * scale, this.canvas.height - ((y + height) * scale), width * scale, height * scale];
}

Where this.scale is the same number I passed to getViewport. Then the following draws an accurate box around the text:

ctx.strokeRect(...this.convertToCanvasCoords([x, y, width, height]));

Note, I had to adjust y again by the height of the box because strokeRect wants the top left corner and even after adjusting for the PDF origin issue, what you end up with is the bottom left corner of the box. So you add the height to get the top, then scale, then fix for origin. There's probably a cleaner way of doing this, but this works, and it has the advantage that I kind of understand what's going on. :) Hope that helps.

knowtheory commented 5 years ago

For others digging around for what pdf.js is actually doing with transformation vectors, the PDF Reference includes a definition of how transformation vectors are laid out and how they relate to mapping into a two dimensional coordinate space.

Specifically, the components of a transformation matrix are described on page 142:

  • Translations are specified as [ 1 0 0 1 tx ty ], where tx and ty are the distances to translate the origin of the coordinate system in the horizontal and vertical dimensions, respectively.
  • Scaling is obtained by [sx 0 0 sy 0 0]. This scales the coordinates so that 1 unit in the horizontal and vertical dimensions of the new coordinate system is the same size as sx and sy units, respectively, in the previous coordinate system.
  • Rotations are produced by [cos θ sin θ −sin θ cos θ 0 0], which has the effect of rotating the coordinate system axes by an angle θ counterclockwise.
  • Skew is specified by [1 tan α tan β 1 0 0], which skews the x axis by an angle α and the y axis by an angle β.

(there's an accompanying chart in the reference as well)

And the vector itself is defined thus:

PDF represents coordinates in a two-dimensional space. The point (x, y) in such a space can be expressed in vector form as [x y 1]. The constant third element of this vector (1) is needed so that the vector can be used with 3-by-3 matrices in the calculations described below. The transformation between two coordinate systems is represented by a 3-by-3 transformation matrix written as

[ed: pretend this is a matrix]

a b 0
c d 0
e f 1

Because a transformation matrix has only six elements that can be changed, it is usually specified in PDF as the six-element array [a b c d e f].

timvandermeij commented 5 years ago

Closing since this is answered now.

haijun-ucsd commented 2 years ago

I have a use case where I would like to get the bounding box of each word within an item. For example, the str of an item can be "hello world!", but the transformation only gives the coordinates of the entire string. In my use case, I would like to get the coordinates of each of the "hello", "world", and "!". These words are not selected or highlighted.

BernhardBehrendt commented 11 months ago

@haijun-ucsd have you found a way to achieve this?

snowfluke commented 9 months ago

@haijun-ucsd have you found a way to achieve this? @BernhardBehrendt You could do something like this:

  1. Get the width of the item, for example x1-x0
  2. Divide the width by the total character in that item including space and symbol (you get average width for each character)
  3. Multiple the average width by the index in the item and you will get the x