mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.34k stars 9.97k forks source link

wrong character rendered #11451

Closed hongnk closed 4 years ago

hongnk commented 4 years ago

This file has a character (symbol) that pdf.js renders wrongly, while most pdf readers get it correctly: error2.pdf

Configuration:

Web browser and its version: chrome browser latest version Operating system and its version: windows 10 latest version PDF.js version: online Is a browser extension: no Steps to reproduce the problem:

  1. open the above file at https://mozilla.github.io/pdf.js/web/viewer.html

correct character: ò pdf.js renders as: Ú (these are wingding symbols shown in below screenshot, but the actual character is as copied and pasted into a text editor)

What is the expected behavior? (add screenshot) image

What went wrong? (add screenshot) image

Snuffleupagus commented 4 years ago

Wingdings is a non-standard font, and in order for such fonts to render (and copy) correctly they need to be embedded in the PDF file.


For reference: When opening an issue, please make sure that you provide all of the information requested in https://github.com/mozilla/pdf.js/blob/master/.github/ISSUE_TEMPLATE.md

hongnk commented 4 years ago

@Snuffleupagus Thanks I'm aware of wingdings font issue. But here i discovered that it is the character that is wrong, not about the font display.

In another test by opening that file in other applications, where wingdings font is not supported, the character is shown as ò [correct character] (in Drawboard PDF program), but it is shown as Ú [incorrect] is Firefox (which is based on pdf.js)

[Updated with more screenshots] Drawboard PDF/Chrome browser/Adobe Reader: rendered as ò image

Firefox/pdf.js render as Ú image

THausherr commented 4 years ago

How about first embedding the font, and then looking again whether text extraction works?

hongnk commented 4 years ago

@THausherr Unfortunately I am unable to get the source file to try that (the file was original but deleted other contents to leave only the symbol). I tried to create a new file with the same wingdings symbol in MS Word and exported to pdf, but pdf.js displays the symbol correctly even without font embedding. So still wonder why other pdf viewers can read the symbol correctly for this particular file, but not pdf.js?

hongnk commented 4 years ago

I found the cause in file pdf.js-dist/lib/core/evaluator.js line 1577:

Font 'WIngdings-Regular' is a symbolic font and it is assigned encoding.MacRomanEncoding by default. while the correct encoding should be WInAnsiEncoding

So character ò (charcode 242) becomes Ú (U acute, charcode 218)

Any chance this will get fixed?

hongnk commented 4 years ago

Thanks for your update. I tested and noticed the character has changed as it is now using ZapDIngbats font, although it is still showing an incorrect symbol.

I'm not sure why the decision to map WIngdings to ZapDIngbats. Why can't just leave it as native encoding, so eiher it appears as garbage (raw ascii codes), or it displays correct symbol if the font exists (on Windows). I believe that the way native pdf viewers (such as Acrobat reader) displays, and it is satisfaction for users.