mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.34k stars 9.97k forks source link

Extreme (or maybe normal) charSpacing does not calculate correctly #5972

Closed kalley closed 3 years ago

kalley commented 9 years ago

In order to get highlighting correctly on OCR documents, we are sizing text to the bounding box also using charSpacing to ensure the fit. On some documents where there is extreme negative charSpacing, the measurement of the text in the evaluator is incorrect. In our case, the charSpacing for the word in question is set to -7.92475.

Here is the original document: https://drive.google.com/file/d/0B7gKNRUF3HwvTXpYSkx1WldSeFE/view?usp=sharing

Here's a version of the pdf with the OCR'ed text overlaid in the bounding box: https://drive.google.com/file/d/0B7gKNRUF3HwvenRmTnBLanVlckk/view?usp=sharing

And here's a screenshot of the selected text after pdfjs has rendered the document: image

It is really the "Foreword" text that we're looking at, as everything else is at least close enough to not be noticeable. I have verified that the text rendered is matched to the width returned from pdfjs.

kalley commented 9 years ago

So, I think I figured it out, but I'm sure there's a better way to do this (plus, I don't happen to have a pdf with vertical font at the moment).

I change https://github.com/mozilla/pdf.js/blob/dfecfca266dfb9a2d5e1f1a67b0e75496626aec8/src/core/evaluator.js#L1016

from

tx = (w0 * textState.fontSize + textState.charSpacing) *
    textState.textHScale;

to

tx = (w0 * textState.fontSize + (i + textChunk.str.length ? textState.charSpacing : 0)) *
    textState.textHScale;

Since the word happens to be broken into 3 different pieces, I had to check the textChunk.str.length as well as the i in the loop.

CodingFabian commented 9 years ago

so what you do is basically only when i is zero and the chunk is zero you do not add the charSpacing? If thats the case we might formulate this more explicitly.

kalley commented 9 years ago

That is correct. This way the charSpacing is only between glyphs and doesn't expand (or in my case, subtract) from the last glyph.

CodingFabian commented 9 years ago

so I could say: "Use charSpacing only if there are chars" sounds totally reasonable.

kalley commented 9 years ago

yep.

CodingFabian commented 9 years ago

@kalley can you reproduce this with a scanned pdf which has no copyright protected content? maybe re-scann only the foreword title?

CodingFabian commented 9 years ago

not necessary anymore @Snuffleupagus contributed a pdf.

kalley commented 9 years ago

thanks guys!

yurydelendik commented 8 years ago

While working on #6588, we found out that the solution suggested here is breaking the spec definition of "When the glyph for each character in the string is rendered, Tc shall be added ...". So currently letters alignment is really off due to artificial padding at the start of the text run. See figure at https://github.com/mozilla/pdf.js/pull/6588#issuecomment-153362265. I'll re-open this issue -- I'll see if it can be fixed at #6590 (or after it will land).

yurydelendik commented 8 years ago

The padding from the last char must be saved and used by text layer builder to extend selection. It can be done e.g. by using letter-spacing (see https://gist.github.com/yurydelendik/aa77f7cab933522c7850)

timvandermeij commented 3 years ago

Closing since the documents are no longer available and this case got improved further in #12896.