mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.35k stars 9.97k forks source link

Certain PDF displays with Scrambled Text in PDF.js #11131

Closed rpschool88 closed 4 years ago

rpschool88 commented 5 years ago

Attach (recommended) or Link to PDF file here:

IOICityHotel-SCB 4198-Bank statement-July17 6.pdf

Configuration:

Steps to reproduce the problem:

  1. Open attached PDF in PDF.js (drag into empty tab of latest Firefox)
  2. Visually inspect PDF and observe dev console

What is the expected behavior? (add screenshot)

PDF displays correct. See pic of PDF opened in Adobe Acrobat. 2019-09-09_1005

What went wrong? (add screenshot)

PDF.js displays blank pane. See screenshot of embedded viewer. 2019-09-09_1007

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

rpschool88 commented 5 years ago

Original description included screenshots of a different PDF issue. Updated to include screenshots of issue described.

Snuffleupagus commented 5 years ago

PDF.js version: 1.7.225

That version is years out-of-date, and thus no longer supported, please find the latest releases at https://github.com/mozilla/pdf.js/releases

rpschool88 commented 5 years ago

Just tried dragging the attached PDF into a blank tab of the latest version of Firefox and reproduced the issue. I'm assuming that means it's present in the latest or relatively recent version of PDF.js?

janpe2 commented 5 years ago

The embedded TrueType font is broken. The 'loca' table violates the OpenType Specification, which says: "The offsets must be in ascending order with loca[n] <= loca[n+1]."

'loca' Table - Index To Location Table
--------------------------------------
Size = 288 bytes, 72 entries
    Idx   0 -> glyfOff 0x00004188* No contours *
    Idx   1 -> glyfOff 0x00004188* No contours *
    Idx   2 -> glyfOff 0x00004188* No contours *
    Idx   3 -> glyfOff 0x00004188
    Idx   4 -> glyfOff 0x000029B2
    Idx   5 -> glyfOff 0x0000088C
    Idx   6 -> glyfOff 0x00000A1E
    Idx   7 -> glyfOff 0x0000075E
    Idx   8 -> glyfOff 0x00001438
    Idx   9 -> glyfOff 0x00000952
    ...

PDF.js can fix a few errors in 'loca'. https://github.com/mozilla/pdf.js/blob/4fa60f006bdc692da87173b205039e05557456ed/src/core/fonts.js#L1792-L1794 In this font the unordered offsets get fixed but now most glyphs seem to have a length of zero ("No contours"), so the glyphs are rendered invisible.

'loca' Table - Index To Location Table
--------------------------------------
Size = 292 bytes, 73 entries
    Idx   0 -> glyfOff 0x00000000* No contours *
    Idx   1 -> glyfOff 0x00000000* No contours *
    Idx   2 -> glyfOff 0x00000000* No contours *
    Idx   3 -> glyfOff 0x00000000* No contours *
    Idx   4 -> glyfOff 0x00000000* No contours *
    Idx   5 -> glyfOff 0x00000000
    Idx   6 -> glyfOff 0x000000C8* No contours *
    Idx   7 -> glyfOff 0x000000C8
    Idx   8 -> glyfOff 0x000001F8* No contours *
    Idx   9 -> glyfOff 0x000001F8* No contours *
    Idx  10 -> glyfOff 0x000001F8
    Idx  11 -> glyfOff 0x000002C4* No contours *
    ...
Snuffleupagus commented 5 years ago

@janpe2 As always, excellent analysis :-)

In this font the unordered offsets get fixed but now most glyphs seem to have a length of zero ("No contours"), so the glyphs are rendered invisible.

That should be explained by this particular code: https://github.com/mozilla/pdf.js/blob/b86bdefcd91f1fb5026a8429e75a512f8d499209/src/core/fonts.js#L1837-L1841 which simply replaces out-of-order glyphs with empty ones to prevent the sanitizer from rejecting the entire font. To fix this, it'd probably be necessary to implement more "complete" repairing of incorrectly ordered 'loca' tables (how difficult that would be, I have no idea).

FleppS commented 2 years ago

Hi, I've the same problem with certain files BUT only if i display them in a canvas inside an iframe. If i display the pdf in a canvas w/o the iframe around, the text is not scrambled. Is there a workaround ?

I use the

Thank you br Philippe