internetarchive / bookreader

The Internet Archive BookReader
https://openlibrary.org/dev/docs/bookreader
GNU Affero General Public License v3.0
997 stars 419 forks source link

Text layer rendering problem in Hanifi Rohingya text #1276

Open bgo-eiu opened 1 year ago

bgo-eiu commented 1 year ago

See: https://archive.org/details/20231022_20231022_2050/page/n3/mode/2up

The text in the right hand side columns mostly does not render, but does come through here and there. This could be related to the characters used—it is Rohingya in the Hanifi script, a right to left writing system. However the characters do appear in places and this text may be included as images, I am unsure.

cdrini commented 1 year ago

Hi @bgo-eiu , thank you for the report!

One of our engineers notes:

this kind of problem is often due to use in the PDF of an uncommon font that isn’t present on the system where we do the conversion, and can be avoided by including the font in the PDF when it’s built.

It looks like you are the uploader of this document ; are you by any chance able to include the font into the PDF and re-upload it?

hbromley commented 1 year ago

Hi, @bgo-eiu, I'm the engineer who was quoted above. Here's a font analysis of the PDF:

$ pdffonts E-N-G-L-I-S-H-T-O-R-O-H-I-N-G-Y-A-D-I-C-T-I-O-N-A-R-Y-N-compressed.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
AYUGFB+Calibri                       TrueType          WinAnsi          yes yes no   13075  0
Rohingya_{n}_muzhari                 TrueType          WinAnsi          no  no  no   13079  0
ABGHHM+Calibri,Bold                  TrueType          WinAnsi          yes yes no   13082  0
AAAAAJ+Arial,Bold                    CID TrueType      Identity-H       yes yes yes  13086  0
Arial,Bold                           TrueType          WinAnsi          no  no  no   13094  0
AAAAAJ+Calibri,Italic                TrueType          WinAnsi          yes yes no   13097  0
AURWPJ+Arial                         CID TrueType      Identity-H       yes yes yes  12557  0
AFMHMR+Traditional Arabic            CID TrueType      Identity-H       yes yes yes      2  0
AAAAAJ+Traditional Arabic            TrueType          WinAnsi          yes yes no       9  0
AFMHMR+Traditional Arabic            CID TrueType      Identity-H       yes yes yes     12  0
AAAAAJ+Amiri                         CID TrueType      Identity-H       yes yes yes     17  0
AAAAAJ+Amiri                         TrueType          WinAnsi          yes yes no      23  0
A-Rohingya_{n}_muzhari               TrueType          WinAnsi          no  no  no   12551  0
Arial                                TrueType          WinAnsi          no  no  no   12554  0
AHTAUU+Calibri                       CID TrueType      Identity-H       yes yes yes  12833  0
Arial                                TrueType          WinAnsi          no  no  no    8106  0
AURWPJ+Arial                         CID TrueType      Identity-H       yes yes yes  12968  0

Note the rows that have "no" in the "emb" column, indicating that those fonts are not embedded in the PDF, and also a "no" in the "uni" column, indicating that the PDF also has no mapping for those characters into Unicode, which would enable us to render the character even if we don't have the non-embedded font installed locally.