mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.35k stars 9.97k forks source link

Space inserted between each letter in textLayer #6705

Closed Woodgnome closed 3 years ago

Woodgnome commented 8 years ago

PDF file: https://drive.google.com/open?id=1ne84gRIMnss30UeXA475A84AY5pmYJyY

The text layer is rendered with extra spaces between each letter, for example:

Correct text: "VELKOMMEN TIL ALMINDINGEN" Rendered text: "V E L K O M M E N T I L A L M I N D I N G E N"

huyvandoan commented 8 years ago

I have got the same issue. Please try this file: https://drive.google.com/file/d/0BxnImIPU4vSKbDAzekNEMGlhcjQ/view?usp=sharing

PDFJS cannot find out the phrase "IF E" if I copy text from the viewer.

jasonparallel commented 8 years ago

@Woodgnome Might want to retest with the latest version. There seems to be a smaller number of extra spaces in my tests

Woodgnome commented 8 years ago

I just downloaded the original PDF and tried opening it in the public demo:

http://i.imgur.com/K4onsTC.png

Still broken from what I can see.

timvandermeij commented 8 years ago

I think I understand what's going on here. The PDF contains TJ operators where each glyph is preceded by the number -70. For the title, for example, we have:

(V)-70(E)-70(L)-70(K)-70(O)-70(M)-70(M)-70(E)-70(N)

According to the specification at https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf#page=407&zoom=auto,-246,31, this value is subtracted from the horizontal spacing, so we are actually moving 70 units to the right. PDF.js interprets this as if it needs to insert a space, since it probably exceeds the SPACE_FACTOR threshold. Not sure how to fix this (there are other issues regarding the SPACE_FACTOR heuristic, so it might need to be revisited), but at least this seems like the cause of the issue. This looks like a rather odd thing to do, but Okular seems to handle this well.

Woodgnome commented 8 years ago

Sounds like the PDF is pretty fucked up, but regardless both Acrobat and Chrome PDF viewer seem to handle it as well.

Without knowing anything about SPACE_FACTOR, wouldn't you be able to compare [Left position] + [Width] of character to [Left position] of next character and determine if there should be a space or not depending on that? Or is that what SPACE_FACTOR is used to do already?

timvandermeij commented 8 years ago

I think that's what it does, but the problem is that it's a heuristic, so it won't be correct all the time. I wonder how other PDF viewers solve this.

TuningGuide commented 8 years ago

Comparing Left Positions unfortunately does not work. I tried it on: https://github.com/mozilla/pdf.js/issues/7327

timvandermeij commented 8 years ago

We might be able to look at how Poppler does it: https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc. It has some constants and logic to add spaces (https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc#L818) that might be different from how we do it.

javop commented 7 years ago

Having the same problem with spaces. Is there any solution or quick fix to this issue?

javop commented 7 years ago

Changing the SPACE_FACTOR on the pdf.worker fixed the problem for me. From 0.3 to 0.5, but i dont know how this change can affect other documents.

Edenharder commented 7 years ago

It seems this problem is not fixed in the pdf.js in Firefox 50.1.0.

dschissler commented 7 years ago

This is affecting a lot of my documents. Is there any way to fix this on the latest 1.7 builds?

Hikariii commented 6 years ago

This still seems to be an issue with 2.0. Any progress on text-layer rendering and the SPACE_FACTOR?

Hikariii commented 6 years ago

Same problem with this pdf: text-spacing-error.pdf

Configuration:

Steps to reproduce the problem:

  1. render the pdf with:
    page.getTextContent({
            normalizeWhitespace: true,
            disableCombineTextItems: true
        })
  2. merge the resulting item strings textContent.item[i].str

What is the expected behavior? The starting text for the second page must be: - 5 1 DE ORGANISATIE EN DE PROBLEEMSTELLING IN HAAR CONTEXT

What went wrong? The starting text for the second page is: - 5 1 D E ORGANISAT IE EN D E P ROBL EEM STEL L ING IN HAAR CONTEXT Spaces are added within the word.

The code that gets adds the space WITHIN the word is: https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L1656-L1658 The given width (advance) is a tiny bit bigger than textContentItem.fakeSpaceMin.

@timvandermeij your statement about the advance being bigger than the spaceWidth * SPACE_FACTOR is correct. Solution here is to set SPACE_FACTOR to 0.4. This renders the words perfectly.

This is an issue within the default pdf.js viewer.

Hikariii commented 6 years ago

In all the presented cases the issue is with capitalized words. Maybe a solution is to introduce a bigger fakeCapSpaceMin to compare with when the glyph is capitalized and leave the SPACE_FACTOR as is.

Snuffleupagus commented 6 years ago

Maybe a solution is to introduce a bigger fakeCapSpaceMin to compare with when the glyph is capitalized and leave the SPACE_FACTOR as is.

In general, given how common it's for PDF generators to provide incomplete/inconsistent/incorrect font data, attempting to do any sort of lower/upper-case detection is quite likely to cause more issues than it solves in many cases unfortunately.

EDIT: Not to mention that adding yet another heuristic, tuned for a particular set of PDF files, probably won't be a good solution in the general case.

Hikariii commented 6 years ago

Looking at the reference code from @timvandermeij I see a very different approach to calculating the min space: https://github.com/danigm/poppler/blob/0011805e22193b690b99a53dcb9986ce04eb3eb4/poppler/TextOutputDev.cc#L749-L773

Why is it pdf.js uses font.spaceWidth and what does font.spaceWidth mean exactly? If this is the width of a space, why is it multiplied by 0.3 to check if a distance between characters is actually a space? I assume that when font.spaceWidth is the (approx) width of a space, the comparison must be with the actual value (or a value close to this width), not a 0.3 factor of this.

Hikariii commented 6 years ago

One more note on this: The SPACE_FACTOR was changed in this commit https://github.com/mozilla/pdf.js/commit/109d67691c866b2c7001524e49c3e53ff9edd762.

The test pdf that was the origin for this change is unrecoverable. Even the small change reverting this factor to 0.35 solves my problem.

Is anyone able to deduce for which cases this threshold needs to be smaller?

Hikariii commented 6 years ago

Maybe https://github.com/euske/pdfminer is a good reference too. A python based text extraction tool.

This tool does not insert any space character when extracting text from a Tj instruction: https://github.com/euske/pdfminer/blob/44977b6726640933d86028d16ca06fab5ea26d1a/pdfminer/pdfinterp.py#L753-L766

The render_string code just renders the characters in the Tj sequence: https://github.com/euske/pdfminer/blob/44977b6726640933d86028d16ca06fab5ea26d1a/pdfminer/pdfdevice.py#L89-L102

Edit: This tool does insert spaces. This code also uses the width and height to calculate margin. Not a font.spaceWidth: https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L369-L375

Still not sure why pdf.js has such a different approach using font.spaceWidth and a seemingly random value.

Snuffleupagus commented 6 years ago

The SPACE_FACTOR was changed in this commit 109d676. The test pdf that was the origin for this change is unrecoverable.

A reduced test-case should have been included in the original PR, but was added in #5806, and is now part of our test-suite; please see https://github.com/mozilla/pdf.js/blob/master/test/pdfs/issue5734.pdf.

Also, if you want to try and work on improving text-selection, I'd highly recommend careful reading of https://github.com/mozilla/pdf.js/wiki/Contributing and in particular section https://github.com/mozilla/pdf.js/wiki/Contributing#4-run-lint-and-testing. There it's described how to generate reference images and run the tests locally, which is necessary in order to validate your changes when working on code residing in the /src folder.

Woodgnome commented 6 years ago

Original sample document is now available at https://drive.google.com/open?id=1ne84gRIMnss30UeXA475A84AY5pmYJyY (I also updated the link in the original post).

In all the presented cases the issue is with capitalized words.

Also an issue with the non-capitalized body text in this PDF.

oestape commented 6 years ago

I just created the issue #9998 which now I see that probably is a duplicate of this one (sorry about that).

In my example the spaces are inserted after all 'i', 'j' and 'l' characters. Copy and paste using Firefox (which uses pdf.js) ads the extra spaces, but Chrome and Acrobat Reader works fine.

My example has been edited with Inkscape, and all the text in in a single text box.

kirkegaard commented 6 years ago

Im having this issue as well. How on earth do i override the SPACE_FACTOR variable?

benjaminwood commented 4 years ago

The only way of 'overriding' SPACE_FACTOR is to change the source code. There is no API/configuration option for it.

Assuming you're using the pdfjs-dist npm package, you'd have to vendor node_modules/pdfjs-dist/build/pdf.worker.js and make your change to that file (I don't recommend it).

Also, I thought I had this problem, but it turned out I did not. I'm using https://github.com/mozilla/pdf.js/blob/master/web/pdf_viewer.js. Adding disableCombineTextItems: true alongside the two places normalizeWhitespace: true is used fixed my problem with extra/unwanted spaces.

It'd be great if somebody with more context/experience with pdf.js could chime in on this issue. Perhaps disableCombineTextItems could be a setting with initializing pdf_viewer.

I don't mean to hijack this issue, I only comment on my experience because I suspect others here may think they have the SPACE_FACTOR, when in reality they do not. Just sharing what I wished I had found when researching my (similar) problem. :smile:

tfwright commented 4 years ago

I'm having the opposite issue as described here with the PDF below (no white space at all). Are there any plans to expose the SPACE_FACTOR setting so that at least it can be adjusted on a per pdf basis to produce better results?

Foucault-What-is-enlightenment.pdf