mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.32k stars 9.97k forks source link

Japanese text in vertical writing mode rendered with glyphs for horizontal writing #12350

Open mandel59 opened 4 years ago

mandel59 commented 4 years ago

Attach (recommended) or Link to PDF file here:

https://www.mofa.go.jp/mofaj/gaiko/treaty/pdfs/treaty159_4a.pdf (via https://www.mofa.go.jp/mofaj/gaiko/treaty/treaty159_4.html)

Configuration:

Steps to reproduce the problem: open the PDF and jump to page 2.

What is the expected behavior? (add screenshot)

image

What went wrong? (add screenshot)

image

Glyphs , and are in horizontal form. (cf. Requirements for Japanese Text Layout 3.1.1 Differences in Vertical and Horizontal Composition in Use of Punctuation Marks )

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension): n/a

THausherr commented 4 years ago

It's also in page 1. The font is named "Ryumin-regular-Identity-H" and the encoding is "Identity-H". I suspect that Adobe itself notices that this is vertical writing and then displays that glyph differently.

1 g
/GS2 gs
0 841 m
0 841 l
f
q
  1 i
  0 841 595 -841 re
  0 841 m
  W
  n
  0.06 840.96 594.96 -840.96 re
  W
  n
  BT
    /G1 1 Tf
    20 0 0 20 287.64 681.67 Tm
    0 0 0 1 k
    0 Tc
    0 Tw
    (\003\261) Tj
    0 -1 TD
    (\003\240) Tj
    T*
    (\003\314) Tj
    T*
    (\036\323) Tj      % this is 0x1ED3 = 7891d
    T*
    (\015\\) Tj
    T*
    (\010Q) Tj
    T*
    (\003t) Tj
    T*
    (\006\024) Tj
    T*
    (\003b) Tj
    T*
    (\003\224) Tj
    T*
    (\011\332) Tj
    T*
    (\016\377) Tj
  ET
Q
mandel59 commented 4 years ago

Japanese fonts are often CID-keyed and basing on Adobe-Japan1. CID+7891 of Adobe-Japan1 is for vertical prolonged sound mark.

https://raw.githubusercontent.com/adobe-type-tools/Adobe-Japan1/master/Adobe-Japan1-7.pdf image

mandel59 commented 4 years ago

Here is the font information of the document.

Type: Type 1 (CID) Encoding: Identity-H Actual Font: KozMinPr6N-Regular Actual Font Type: Type 1 (CID)

image

mandel59 commented 4 years ago

I mean, Adobe Acrobat Reader should not care whether it is vertical writing or not (because a Japanese document often includes both horizontal and vertical parts, such as tate-chu-yoko, so a Japanese font has both forms of glyphs.) Ryumin (a well-known Japanese serif font) just falls back to KozMinPr6N, using CID.

THausherr commented 4 years ago

Thank you both. I found that font (A-OTF-RyuminPro-Regular.otf) but PDF.js still has the problem, despite restarting firefox. (However PDFBox now renders correctly, thus proving your argument)

brendandahl commented 4 years ago

Looks like we should set writing-mode: vertical-rl; on the canvas when the font file isn't in the PDF and we're drawing vertical text. This seems to work correctly in firefox, but in chrome all glyphs are rotated. Firefox:

image

Chrome:

image
yuis-ice commented 2 years ago

I have a similar problem. In my case I'm reading a study paper where mainly in horizontal format but sometimes vertical context with a graph. The problem is that when I search text in say Foxit it recognizes the text but in the pdf.js viewer view when I search text it doesn't recognize the text but does recognize horizontal texts. So for instance I'm seeing the problem like when I search a word in Foxit it finds lets say 60 match results, but in pdf.js 55 match results, like that.

Simply, this is a crucial problem when I cannot miss a match when I query a search. Is this a same issue to OP's?? let me know if it doesn't I'll open a new issue. Thanks.

aehlke commented 1 year ago

@yuis-ice #13080 shows this might have closed as a dupe before your question so you should file a new ticket.