mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.34k stars 9.97k forks source link

PDF font is rendered incorrectly #17910

Closed svd-sea closed 6 months ago

svd-sea commented 6 months ago

Link to PDF file: font test.pdf

Link to used font: https://github.com/satbyy/go-noto-universal

Configuration: Web browser: Firefox 124.0.2 (64-Bit) Operating system: Windows 11 (10.0.22631) PDF.js version: 4.1.249

Steps to reproduce the problem:

  1. open the PDF document with firefox

What is the expected behavior?

The file looks like this in Adobe Acrobat Pro and other viewers

grafik

What went wrong?

The file looks like this with pdf.js

grafik

Additional info:

The problem is propably related to the used font. The file uses the font type GoNotoCurrent as TrueType (CID)

grafik

Snuffleupagus commented 6 months ago

This is unfortunately a bug in the PDF document itself, since it uses a non-standard font without embedding it.[1]

Please note that you must embed all non-standard fonts in order for any PDF document to be valid, hence the bug is actually in your PDF document and not in the PDF.js library.


[1] If you don't have the mentioned font installed locally, not even Adobe Reader is able to render that document successfully.

svd-sea commented 6 months ago

You are correct in the point that if the font is not installed locally, the document will not be shown correctly or with a fallback font. As the Font is available in the viewers environment, a conforming reader should be able to show the font correctly.

Embedding a font in a PDF document is an optional feature. Non-standard fonts do not need to be embedded in a PDF document to be used. As the pdf specification states: "A font shall be represented in PDF as a dictionary specifying the type of font, its PostScript name, its encoding, and information that can be used to provide a substitute when the font program is not available. Optionally, the font program may be embedded as a stream object in the PDF file." https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf (page 253)

I also tested the document with acrobats preflight and no errors could be found: grafik

As every other viewer I tested can view the PDF document without problems this is likely not a problem or bug with the document.

Example with chromes browser pdf viewer (same system with font installed): grafik

calixteman commented 6 months ago

Sorry but I installed GoNotoCurrent-Regular.ttf on Windows 11, and neither Chrome nor Edge are able to render correctly the pdf but Acrobat is. The font family is Go Noto Current-Regular when it's used as GoNotoCurrent-Regular in the pdf (no white spaces) and usually Regular isn't used in the family name but in a sub-family. And it's worth noting that the encoding is identity which doesn't help to find a substitute when the font isn't there.

christianAppl commented 6 months ago

I will ignore the Go Noto font in the following entirely, as the specific font programm (and how and what it is) is not really of interest.

The demonstrated scenario: A CID Type2 font with encoding "Identity-H" is contained in a Type 0 composite font. It is not embedded. This is conforming to the PDF specification, no matter if it is convenient or even a good idea or not. (By the way: Probably it is not a good idea - which is my personal opinion and does not contribute to resolving the question.)

The actual questions:

The central question is:

If that is not the case:

Buggy document:

This is unfortunately a bug in the PDF document itself, since it uses a non-standard font without embedding it.[1]

Please note that you must embed all non-standard fonts in order for any PDF document to be valid, hence the bug is actually in your PDF document and not in the PDF.js library.

The specification and Adobe preflight disagree with this evaluation. The document is conforming to the PDF specification. The document is not "bugged". Possibly pdf.js does not support a feature used by this document, but the given answer is misleading and unfounded.

I have two additional issues with the given answer:

  1. What is your definition of a "standard font"? The PDF14 standard fonts?

    • Which fonts are okay to not be embedded according to pdf.js?
    • Which fonts are fine to be handled only by reference according to pdf.js?
    • Which fonts must necessarily be embedded according to pdf.js?
  2. I would expect, that a conforming reader can locate a font in the local environment. Should such fonts exist and be referenced correctly. I would be amazed, if you told me otherwise, as referencing fonts instead of embedding them is not exactly an "exotic" feature for PDF documents. Maybe pdf.js does not support this specific font, font type, or scenario for some reason, but a complete inability to handle referenced fonts would be rather disappointing.

Should composite fonts be the issue here, or should the font not be supported for another reason, then a clarification on the limitations of pdf.js and a clarification of the given answer would be most welcome.

The actual scenario: Does pdf.js support a font program, that is referenced like this:

7 0 obj
<<
/Type /Font
/BaseFont /SomeFontName
/Subtype /Type0
/Encoding /Identity-H
/DescendantFonts [8 0 R]
>>
endobj
8 0 obj
<<
/Type /Font
/Subtype /CIDFontType2
/BaseFont /SomeFontName
/CIDSystemInfo 9 0 R
/FontDescriptor 10 0 R
...
>>
10 0 obj
<<
/Type /FontDescriptor
/FontName /SomeFontName
/Flags 4
/FontWeight 400.0
/ItalicAngle 0.0
/FontBBox [-1567.0 -995.0 7175.0 1645.0]
/Ascent 1069.0
/Descent -293.0
/CapHeight 714.0
/XHeight 536.0
/StemV 1136.46
>>
calixteman commented 6 months ago

First of all, the pdf.js environment isn't the OS itself but the web browser where it's running. From the specs:

If a PDF file refers to
font programs that are not embedded, the results depend on the availability of fonts in the conforming reader’s
environment.

Afaik, nothing in the specs says that a conforming reader has access to the system fonts.

So even if the pdf is syntactically correct, there are some chances that the pdf won't be rendered correctly.

That said, when the font is not embedded we're trying to make some magic to guess what a good substitute could be. But we don't have access to the font file itself which means that even if we know the name of the font we won't be able to render a text with a "bad" encoding, so we must have a way to map glyph ids on unicode because canvas api only allows us to draw unicode strings. We try to guess the font family name, here the base font name is GoNotoCurrent-Regular which must be derived into Go Noto Current-Regular Regular or into GoNotoCurrent-Regular-Regular to be loadable. For now there is no code for such a case (I mean adding some spaces and/or appending Regular) but it's something we can add (https://github.com/mozilla/pdf.js/blob/master/src/core/font_substitutions.js).

svd-sea commented 6 months ago

Thank you for the clarification. So as I understand it the main problem is, that due to the browser´s sandbox we can't access font files directly and reference the font by the name given in the PDF document. Is it possible to provide the font to pdf.js directly via parameter?

I don't think adding a specific handling for the Go Noto font is necessary, as this would just add a handling for this font only. (Plus probably a surrogate matching the fonts unicode range could not possibly be provided.)

calixteman commented 6 months ago

The best thing is to embed the font: you aren't obliged to embed all the font file, depending on the pdf generator you use, it should be possible to generate a font subset with only the used glyphs. For example if you print and save as pdf an html page in Firefox, the font is a subset: plop.html.pdf The pdf here is only 5kb when the GoNotoCurrent font is around 14Mb. If you really don't want to embed the font: it's fine but there are two conditions:

christianAppl commented 6 months ago

Hello and thank you for your answers.

We will apply multiple of your suggestions. Yes we are able to embed subsets of fonts easily and obviously that would be the best solution and would cause the least issues for readers... however, whether fonts are embedded or not, is not our choice: One - A user may not want to embed specific fonts or may not want to embed fonts at all. (For any good or bad reason she likes.) Two - A preexisting PDF - that was not created by us - may not contain embedded fonts.

Hence the requirement to handle fonts by reference.

However we can and will do more to assist pdf.js in handling this (for the PDFs, that we create):

Concerning "handling fonts existing in the environment": You are right in that your are not required (or could be expected) to handle fonts from the client´s OS. I understand, that the environment is limited to what the browser gives you access to. However: Thank you for clarifying those limitations.

As an aside: Should Adobe Reader not find a font on the local OS and can not find a unicode mapping (like easily reproducable with the given scenario), it will display an error and will not even attempt to find a surrogate, as - obviously - there is no way whatsoever to guess GIDs and unicode values for some CID without a mapping. (Probably pdf.js should also display an error/warning, when running out of options? You actually have no chance to succeed in the given scenario.)

Concerning GoNoto: Rather FYI and as an total aside, as gonoto really is not central to any of this. The dictionary and font name is correct as given. Multiple variants of that font exist and you picked one, that was not used during the creation of the PDF. (Had a look at the font tables and theoretically all is fine as is. However, all that does not matter much: Even if that is the case, it does not help pdf.js whatsoever, as long as it has no access to the actual font files.) As long as pdf.js is not provided with the means to map CID->unicode it will necessarily fail. So that is what I will provide whenever possible.

Most importantly: Thank you for your help.