Closed kalley closed 3 years ago
So, I think I figured it out, but I'm sure there's a better way to do this (plus, I don't happen to have a pdf with vertical font at the moment).
from
tx = (w0 * textState.fontSize + textState.charSpacing) *
textState.textHScale;
to
tx = (w0 * textState.fontSize + (i + textChunk.str.length ? textState.charSpacing : 0)) *
textState.textHScale;
Since the word happens to be broken into 3 different pieces, I had to check the textChunk.str.length
as well as the i
in the loop.
so what you do is basically only when i is zero and the chunk is zero you do not add the charSpacing? If thats the case we might formulate this more explicitly.
That is correct. This way the charSpacing is only between glyphs and doesn't expand (or in my case, subtract) from the last glyph.
so I could say: "Use charSpacing only if there are chars" sounds totally reasonable.
yep.
@kalley can you reproduce this with a scanned pdf which has no copyright protected content? maybe re-scann only the foreword title?
not necessary anymore @Snuffleupagus contributed a pdf.
thanks guys!
While working on #6588, we found out that the solution suggested here is breaking the spec definition of "When the glyph for each character in the string is rendered, Tc shall be added ...". So currently letters alignment is really off due to artificial padding at the start of the text run. See figure at https://github.com/mozilla/pdf.js/pull/6588#issuecomment-153362265. I'll re-open this issue -- I'll see if it can be fixed at #6590 (or after it will land).
The padding from the last char must be saved and used by text layer builder to extend selection. It can be done e.g. by using letter-spacing (see https://gist.github.com/yurydelendik/aa77f7cab933522c7850)
Closing since the documents are no longer available and this case got improved further in #12896.
In order to get highlighting correctly on OCR documents, we are sizing text to the bounding box also using charSpacing to ensure the fit. On some documents where there is extreme negative charSpacing, the measurement of the text in the evaluator is incorrect. In our case, the charSpacing for the word in question is set to -7.92475.
Here is the original document: https://drive.google.com/file/d/0B7gKNRUF3HwvTXpYSkx1WldSeFE/view?usp=sharing
Here's a version of the pdf with the OCR'ed text overlaid in the bounding box: https://drive.google.com/file/d/0B7gKNRUF3HwvenRmTnBLanVlckk/view?usp=sharing
And here's a screenshot of the selected text after pdfjs has rendered the document:
It is really the "Foreword" text that we're looking at, as everything else is at least close enough to not be noticeable. I have verified that the text rendered is matched to the width returned from pdfjs.