kreier / timeline

An overview of the human history as a graph in a pdf file.
MIT License
2 stars 2 forks source link

Support for Khmer contains errors #35

Closed kreier closed 1 month ago

kreier commented 3 months ago

On many locations I observed a dotted circle in the Khmer word, and it does not look like a Khmer character. Investigating further, it looks like rendering Khmer with the NotoSans font does not match the exact writing. To start the investigation I look a the summary list in the bottom left corner. It should read:

54មនុស្ស 12ចៅក្រម 19ហោរា 53ស្តេច 82រយៈពេល 37ព្រឹត្តិការណ៍ 18វត្ថុឬវត្ថុ 80សមាជិកនៃគ្រួសាររបស់ Terah

But instead this is rendered: image

kreier commented 3 months ago

Imported the text above into Word and rendered it with Noto Sans Khmer. It shows the rendering errors. image

kreier commented 3 months ago

The problem was recorded on Stackoverflow: https://stackoverflow.com/questions/76634531/unicode-characters-not-display-correctly-in-the-converted-pdf-file-using-xhtml2p

kreier commented 3 months ago

The problem was added to the official Google Groups Support forum: https://groups.google.com/g/reportlab-users/c/WHuatWlUUpE

kreier commented 3 months ago

The documentation states: "No special handling at all is needed to work with Asian TrueType fonts." https://docs.reportlab.com/reportlab/userguide/ch3_fonts/

I think this is true for the CJK glyphs in Chinese, Japanese and Korean and simpler diacritics in combined glyphs like in Vietnamese. But the combination of some other Asian language glyphs as needed in Lao, Khmer and Thai seems not to work correctly yet.

kreier commented 3 months ago

I got a nice response from ChatGPT into some details. The provided example with three Unicode characters combining a Base Consonant, a Dependent Vowel and Subscript Consonants does work for reportlab. Here the answer:

In the Khmer script, combined characters are created using a base consonant followed by one or more dependent vowel signs, subscript consonants, and other diacritical marks. This combination process is similar to other scripts that use complex text rendering. When represented in Unicode and encoded in UTF-8, these combined characters follow a specific order to ensure proper display.

Here's how it works:

  1. Base Consonant: Every syllable starts with a base consonant character. Khmer consonants range from U+1780 to U+17A2 in Unicode.
  2. Dependent Vowels: Dependent vowel signs are attached to the base consonant. These vowels range from U+17B6 to U+17D3. They may appear before, above, below, or after the consonant, and their position is managed by the rendering engine based on their Unicode value and the base consonant they follow.
  3. Subscript Consonants: Subscript forms of consonants, used to write consonant clusters, are encoded using a subscript modifier (U+17D2). The subscript consonant itself is in the range U+1780 to U+17A2, same as the regular consonants.
  4. Diacritical Marks: Various diacritics, such as U+17C6 (KHMER SIGN NIKAHIT) and others from U+17C9 to U+17D3, are used to modify the sounds or meanings of the base consonants and vowels.
  5. Ordering: In Unicode, the base consonant comes first, followed by the subscript consonants (if any, marked by the subscript modifier), then any dependent vowels, and finally any additional diacritics.

Example of Unicode Sequence

Let's consider the Khmer syllable "កាំ" (kâm):

In Unicode, the sequence would be:

U+1780 (ក) 
U+17B6 (ា) 
U+17C6 (ំ)

When encoded in UTF-8, these characters would be represented as follows:

So, the full UTF-8 sequence for the syllable "កាំ" would be:

E1 9E 80 E1 9E B6 E1 9F 86

Rendering

Rendering engines (like those in web browsers or text editors) take this sequence and properly combine and position the characters based on the rules of the Khmer script. The process involves:

In summary, combined characters in the Khmer script are rendered in Unicode using UTF-8 by encoding the base consonant followed by any necessary subscript consonants, dependent vowels, and diacritics in a specific order, which is then interpreted by the rendering engine to display the proper combined form.

kreier commented 3 months ago

However, the Khmer syllable "ស្ស" (ssa) is not rendered correctly. Here the explanation what it consists of and how it should be rendered:

The Khmer syllable "ស្ស" (ssa) consists of a base consonant followed by a subscript consonant. Here’s a detailed breakdown of the Unicode sequence:

Unicode Sequence

  1. Base Consonant:

    • U+179F (ស)
  2. Subscript Consonant:

    • U+17D2 (KHMER SIGN COENG)
    • U+179F (subscript form of SA)

Full Unicode Sequence

Putting these together, the full Unicode sequence for "ស្ស" is:

U+179F (ស) 
U+17D2 (្) 
U+179F (្ស)

UTF-8 Encoding

To represent this sequence in UTF-8, each Unicode code point is converted to its corresponding UTF-8 byte sequence:

Full UTF-8 Sequence

Combining these, the UTF-8 encoding for the sequence "ស្ស" is:

E1 9E 9F E1 9F 92 E1 9E 9F

Rendering Process

In summary, the Unicode sequence for "ស្ស" involves a base consonant followed by a subscript sign and another consonant, encoded and rendered according to the rules of the Khmer script. The UTF-8 encoding ensures each character is correctly represented in byte form, which the rendering engine interprets to display the correct combined character.

kreier commented 3 months ago

Challenges reported as it might be related to only a subset of the font embedded in the pdf. Here an observation from 2021:

https://groups.google.com/g/reportlab-users/c/mxVz1vxeZCk

Update 30.05.2024 no it is not related to embedding a subset. That is standard practice (and in a way sometimes necessary since a single font in pdf seems only have up to 256 characters?) and other examples below show that it works just fine.

kreier commented 3 months ago

My example ស្ស interestingly consists of two of the same consonants, with the same Unicode U+179F but because with have the indicator U+17D2 (KHMER SIGN COENG) the second one is to be rendered differently. The current reportlab version does not do that.

I tried a different Python package to create a PDF file, PyMuPDF, but in order to render new pages with non-Latin fonts it uses a package fonttools. And I got the same result. There is actually an open issue from 2021 regarding a similar issue with the character ឃើ : https://github.com/fonttools/fonttools/issues/2387 and it actually started 2020 with Google fonts (all are still open):

It might actually be that this whole problem is related to an old implementation of HarfBuzz for OpenType fonts. My TrueType fonts might be an older subset of these.

kreier commented 3 months ago

The current repository for reprotlab (4.2.1) can be found on their website as Mercurial bitbucket: https://hg.reportlab.com/hg-public/reportlab . It is mirrored to Github on https://github.com/MrBitBucket/reportlab-mirror

The part that is responsible to render the TrueType fonts (I think) is https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/pdfbase/ttfonts.py - last edited 4 months ago by robin. The strings we want to use and embed are 16-bit Unicode characters mentioned in the introduction.

Our example character "ស្ស" will be embedded into a generated pdf file. Both for Word and reportlab this results in a 10 kByte file, but the object streams inside the pdf are different, and the reportlab rendering is not correct. Libreoffice creates yet another pdf file with other streams containing the subset of the font file, but it manages to correctly embed this glyph in only a 5 kByte pdf. The character streams are the main source for the size difference. Yet, how to get reportlab to correctly render these characters - I have no clue yet.

kreier commented 3 months ago

Another option would be iText - in the iText Core version 8 community edition. But I would have to move from Python to Java or .NET (C#). The Community edition should be open source and it supports Khmer since version 7.0.4 in 2017.

kreier commented 3 months ago

Hi Khaled,

Thanks for looking into this. I know too little about Unicode code points, glyph indices and font subsetting. With this timeline project I try to learn on the fly. I noticed some glyphs in Khmer and Sinhala were rendered differently in the generated pdf than in the browser, editor or even Word. To investigate further I created two test scripts, one with reportlab and one with pymupdf & fonttools:

Both create the same flawed glyph combination (instead of ស្តេច ហោរា and සමුළුව):

image in Khmer and Sinhala

In an earlier attempt I tried to set the option and create a subset with the pymupdf and fonttools version. I got some error messages when trying to activate to set layoutFeatures or text (as "not supported"). Probably a syntax error on my side with this library. And while the khmer_unicode_range = range(0x1780, 0x1800) and subsetter option is in the program, the created pdf states the font as Embedded, not Embedded subset.

In the earlier mentioned example I did not use the subset generation in the fonttools example and got a larger file (125 kByte) compared to the reportlab version (22 kByte). Acrobat reader states that the reportlab version contains an Embedded Subset of NotoSans Khmer and Sinhala, while the larger fonttool version only states that these two fonts are embedded, not a Subset.

To me that's an indication that there is more than a flawed subset generation as the problem, since the fonttool version has the complete font embedded, but the rendering is still flawed in the same way. So no missing glyphs in the embedded font. When I highlight the rendered text and copy/paste it into an editor/web browser/Word I get the correct content. So I think the Unicode code points are unchanged, even though the glyph indices are not correct, right? I think this is part of the philosophy to have them separate in TrueType? Again, I know close to nothing about it. Maybe you can help me with this one.

And thanks for having a look at the "proof of concept" Arabic version of my project. The utf-8 strings are not yet passed through the arabic_reshaper and bidi.algorithm.get_display packages (just something I found on stackoverflow). It's just replacing the english string and sending it to the renderer of reportlab. Surprisingly when highlighting some parts of the text it has some RTL behaviour as on websites or programs that are in RTL. So including reshaper and other converters will be one of the further steps to actually have an Arabic version. And definitely a native speaker to check the translation, Azure Translator and Google made enough mistakes in the languages I do speak a little or have friends knowing them.

Again, thanks for taking time to look at this - and maybe you can help me with the current Khmer and Sinhala rendering problems.

Matthias

‪On Thu, 23 May 2024 at 13:00, ‫خالد حسني (Khaled Hosny)‬‎ < @.***> wrote:‬

The HarfBuzz and FontTools issues are related to subsetting using Unicode code points as input. This kind of subsetting is typically to make fully functional fonts with smaller character set (e.g. used for web fonts to serve smaller files that cover only the page content). PDF subsetting is a lot simpler since for PDF only glyphs used are needed and subsetting uses glyph indices as input not Unicode code points.

The problem seems to be that the tool/library used to generate the PDFs do not do proper text layout. The Arabic PDF https://timeline24.github.io/timeline_ar.pdf linked from the README is completely unreadable, the text is set left-to-right (it should be right-to-left) and letter that should join/change shape are not joined. There are even characters missing from the font and are rendered as empty boxes (which suggests no font fallback is performed, which is another totally different issue).

kreier commented 3 months ago

This is the rendered output of both python programs above (not shown in an email reply, not even when edited):

image in Khmer and Sinhala

Correct is:

ស្តេច ហោរា and සමුළුව

kreier commented 3 months ago

The 13 lines of example code to test Khmer and Sinhala are:

# example rendering in some languages
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
matrix = [["Khmer", "King Prophet", "ស្តេច ហោរា"],
          ["Sinhala", "Conference", "සමුළුව"]]
my_canvas = canvas.Canvas("example_reportlab.pdf")
for i in range(len(matrix)):
    pdfmetrics.registerFont(TTFont(matrix[i][0], '../../fonts/Noto' + matrix[i][0] + '.ttf'))
    my_canvas.setFont(matrix[i][0], 32)
    my_canvas.drawString(72, 749-90*i, f"Language {matrix[i][0]}:")
    my_canvas.drawString(72, 713-90*i, f"Word '{matrix[i][1]}' - {matrix[i][2]}") 
my_canvas.save()

image in Khmer and Sinhala

I'll try iText to see if it is an option (with Java or C#). Out of the box these Unicode characters are not rendered, but let's see.

kreier commented 3 months ago

iText renders the same result. In the iText Demo Lab you can create your own pdf document with an embedded Java code editor. I entered the following:

import com.itextpdf.kernel.pdf.*;
import com.itextpdf.layout.Document;
import com.itextpdf.layout.element.Paragraph;
import java.io.*;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.io.font.PdfEncodings;

public class HelloWorld {
  public static final String DEST = "/myfiles/example_iText.pdf";
  public static final String FONT_KHMER = "/uploads/NotoKhmer.ttf";
  public static final String FONT_SINHALA = "/uploads/NotoSinhala.ttf";
  public static final String KHMER = "ស្តេច ហោរា";
  public static final String SINHALA = "සමුළුව";

  public static void main(String args[]) throws IOException {
    PdfDocument pdf = new PdfDocument(new PdfWriter(DEST));
    Document document = new Document(pdf);
    document.setFontSize(30).add(new Paragraph("Language Khmer:"));
    PdfFont fontKhmer = PdfFontFactory.createFont(FONT_KHMER, PdfEncodings.IDENTITY_H);
    document.add(new Paragraph().setFont(fontKhmer).setFontSize(30).add("Word 'King Prophet'  ").add(KHMER));

    document.setFontSize(30).add(new Paragraph("\nLanguage Sinhala"));
    PdfFont fontSinhala = PdfFontFactory.createFont(FONT_SINHALA, PdfEncodings.IDENTITY_H);
    document.add(new Paragraph().setFont(fontSinhala).setFontSize(30).add("Word 'Conference'  ").add(SINHALA));
    document.close();
  }
}

The render problems are the same as mentioned above. https://github.com/kreier/timeline/issues/35#issuecomment-2128706454

image

kreier commented 3 months ago

And we're getting closer to an answer: Many scripts and Glyphs are supported in core of iText (including Russian, Armenian, Greek, Chinese, Japanese, Korean) but my two problem languages Khmer and Sinhala require the module pdfCalligraph. It's actually 14 scripts for more than 51 languages.

kreier commented 3 months ago

Back to reportlab: An older conversation on the reportlab Google group from 2015 talks about the composite glyph positioning with responses from Glenn Lindermann, Robin Becker and Andy Robinson. And some Unicode and ttf history of reportlab.

kreier commented 3 months ago

The use of a shape engine like harfbuzz does the job. I got it working in fpdf2 after pip install uharfbuzz:

# example rendering Khmer
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King        - ស្តេច")
pdf.ln()
pdf.cell(text="Prophet - ហោរា")
pdf.ln()
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King        - ស្តេច")
pdf.ln()
pdf.cell(text="Prophet - ហោរា")
pdf.output("example_fpdf.pdf")

Result:

image

Let's see if I can import the this in reportlab.

timeline24 commented 3 months ago

And with the embedded subset of the font in Type TrueType (CID) and encoding: Identity-H it has only 7 kByte size, half the size of the Word and Google Docs solution.

kreier commented 3 months ago

It looks that fpdf2 had a similar problem in 2022. This issue https://github.com/py-pdf/fpdf2/issues/365 mentions Khmer (https://github.com/py-pdf/fpdf2/issues/700) among Arabic, Hindi and other languages. And with the switch to the Fonttools library and harfbuzz in pull request 447 https://github.com/py-pdf/fpdf2/pull/477 it seems many other issues are resolved. gmischler describes the changed approach in issue 418 https://github.com/py-pdf/fpdf2/issues/418. By 2023 the implementation appears to be stable. @andy-robinson @replabrobin Is it possible to do something similar in reportlab?

In a forum post from 2015 https://groups.google.com/g/reportlab-users/c/scxAhaReanI/m/IYSaDfoH9ZkJ Andy Robinson mentions that :'We are trying to work out the right font descriptors and sequences of bytes to put in the PDF file so that the right stuff magically happens on screen.". In the same post he describes his work on Japanese in 2002-2003 (that's why my CJK versions have no problem) and that around 2009 an Arabic speaking employee worked on the project. I could not find a specific reference in the source code, but on stack overflow a working solution includes https://pypi.org/project/arabic-reshaper/ and bidi.algorithm

The function to create the embedded subset of the TTF font is part of the https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/pdfbase/ttfonts.py file. Is it here some ligature substitutions needed for Khmer, Sinhala and many other languages should be integrated?

kreier commented 3 months ago

Here are a few more details of the integration of harfbuzz with uharfbuzz from a proof-of-concept to finalization in early 2023: https://github.com/py-pdf/fpdf2/discussions/696

There are also some testfiles linked for Thai. Might be worth checking out, since the Thai script was developed from the Khmer one a few centuries ago.

kreier commented 3 months ago

Back to basics. Let's take the simple ឆ្នាំ which translates to 'years'. It consists of 5 codepoints:

When copy/paste the rendered glyphs in the pdf we get ឆ្នា􀀖 ំ as result. Codepoints.net finds 1431 codepoints in here. With the help of a little python program:

text = "ឆ្នាំ"
for char in text:
    print(f"Character '{char}' has codepoint {ord(char):X}")

We get 7 codepoints. Sees like this is the sequence the shape engine produced:

The first 4 codepoints are unchanged, but then '100016' and '20' are integrated.

replabrobin commented 2 months ago

HI Matthias, having difficulty emailing directly. It seems you post in a google 'reportlab-users' group. Our official mail list is not run by me, but has address https://two.pairlist.net/pipermail/reportlab-users/. I imagine you would like us to support proper harfbuzz shaping etc etc.

I would like to integrate uharfbuzz into the reportlab paragraph code, but there are a number of issues which I don't yet have solutions for.

I have no experience of the khmer codes, but when I tried your example above I didn't get the same outcome after shaping I get only three outputs so the code below produces

uni178617B6 gid248=0@923,0+923 uni17D21793 gid209=0@0,-26+0 uni17C6 gid137=0@0,-29+0

#!/bin/env python
import uharfbuzz as hb

if False:
    import sys
    fontfile = sys.argv[1]
    text = sys.argv[2]
else:
    fontfile = '/home/robin/devel/reportlab/REPOS/reportlab/tmp/NotoSansKhmer/NotoSansKhmer-Regular.ttf'
    #1786 Khmer Letter Cha
    #17D2 Khmer Sign Coeng
    #1793 Khmer Letter No
    #17B6 Khmer Vowel Sign Aa
    #17C6 Khmer Sign Nikahit
    text = '\u1786\u17D2\u1793\u17B6\u17C6'

blob = hb.Blob.from_file_path(fontfile)
face = hb.Face(blob)
font = hb.Font(face)

buf = hb.Buffer()
buf.add_str(text)
buf.guess_segment_properties()

features = {"kern": True, "liga": True}
hb.shape(font, buf, features)

infos = buf.glyph_infos
positions = buf.glyph_positions

for info, pos in zip(infos, positions):
    gid = info.codepoint
    glyph_name = font.glyph_to_string(gid)
    cluster = info.cluster
    x_advance = pos.x_advance
    x_offset = pos.x_offset
    y_offset = pos.y_offset
    print(f"{glyph_name} gid{gid}={cluster}@{x_advance},{y_offset}+{x_advance}")
kreier commented 2 months ago

Hi Robin @replabrobin,

Thanks for answering here. Yes, it would be great if harfbuzz could be integrated into reportlab!! I tried to sign up for the email list but got no response. And I posted some questions at the Google groups but this groups probably needs some cleanup. Anyway, back to the question of shape engine.

I think the Khmer glyph "ឆ្នាំ" is a good example (it means years) , in Unicode represented with 5 codepoints '\u1786\u17D2\u1793\u17B6\u17C6'. Without font shaping the 5 codepoints combined with a font glyphs and their individual width gives not the correct final glyph. I tried to combine the result you got (got the same results) from uharfbuzz with NotoSansKhmer and https://fontdrop.info/. Now it is only three codepoints, and some additional information about how to shift the glyphs in the combined glyph:

image

Above are the 3 glyph points uni178617B6, uni17D21793 and uni17C6. Only the last one is a Unicode codepoint, the others only exist inside the font as glyph points. Since the individual glyphs have to be correctly positioned its not possible just to pass the string of updated glyph points to be included in the pdf, but each glyph has to be put in the correct position by the python script that puts the glyphs in the pdf.

I guess currently there is already some part of glyph positioning integrated in reportlab, now it needs to additionally process the location output from harfbuzz for the glyph position, not just the information included in the font for each glyph.

I'm sure this will be a considerable effort to integrate - I've seen a little of the work done at fpdf2 in the last 2 years - but maybe I can at least help a little with beta-testing. Just recently a small bug was fixed https://github.com/py-pdf/fpdf2/issues/1187

kreier commented 2 months ago

In the post to fpdf2 mentioned above gmischler explains the steps fpdf2 takes to integrate a Unicode string into the correct sequence of glyphs. He wrote:

  1. fpdf2 accepts a sequence of characters, and passes it to pyharfbuzz.
  2. pyharfbuzz converts the python string to a C structure and passes it to harfbuzz.
  3. harfbuzz consults the font file, combines character sequences into glyph clusters, and adds the width information given in the font file to each cluster.
  4. pyharfbuzz converts the result back into python data
  5. fpdf2 uses the returned width information for line wrapping, and adds the resulting line data into the PDF stream.
  6. A PDF viewer reads that stream, and needs to figure out where to place the glyphs on the page.

I think it should be a similar sequence for reportlab.

And I found the value of advance width of mark attached glyphs. The first returned glyph from harfbuzz uni178617B6 actually has a width of 923, as indicated with the response "uni178617B6 gid248=0@923,0+923". It can be seen with https://www.glyphrstudio.com/app/

image

As requested by your code print(f"{glyph_name} gid{gid}={cluster}@{x_advance},{y_offset}+{x_advance}") the value for x_advance for the glyph is returned (zero for the next two uni17D21793 and uni17C6

image

The {y_offset} values indicate that their location should be slightly adjusted in the final glyph. Not sure how this caused a problem in the fpdf2 string_width calculation, since the advanceWidth values are 923, 0 and 0 and it looks like 923 is the correct value.

kreier commented 2 months ago

I changed the last line of your code to print the offset for x and y as returned by harfbuzz

print(f"{glyph_name} \t gid{gid}={cluster} \t advanceWidth: {x_advance} \t offset x:{x_offset} y:{y_offset}")

The output is

uni178617B6      gid248=0        advanceWidth: 923       offset x:0 y:0
uni17D21793      gid209=0        advanceWidth: 0         offset x:-296 y:-26
uni17C6          gid137=0        advanceWidth: 0         offset x:47 y:-29

Which verifies the shifted location of the two additional glyphs seen in the combined glyph when rendered as ឆ្នាំ in two posts above.

kreier commented 1 month ago

This problem is addressed in the official reportlab forum: https://groups.google.com/g/reportlab-users/c/WHuatWlUUpE

For me this is solved now after switching to fpdf2. I might return to reportlab in the future when the font shape engine is implemented.