Issue on Khmer Unicode Font Subscripts

kuth-chi commented 1 month ago

Describe the bug

Error details I got issue on render Khmer Unicode font as PDF output file. I got inappropriate Khmer scripts "សួស្តី ពីភពលោក" I got "សួស្តី ពីភពលោក" instead error subscript characters.

Below is typing test the fonts Google Fonts:

Minimal code

from fpdf import FPDF
import os 

CURRENT_PATH = os.path.dirname(os.path.abspath(__file__))
fonts_dir = os.path.join(CURRENT_PATH, '..', 'test', 'fonts')

pdf = FPDF()
pdf.add_page()
pdf.set_text_shaping(True)

for txt in text.split("\n"):
    pdf.write(8, txt)
    pdf.ln(8)

# This is KhmerOS is most use in Khmer language for official documents (formerly regular style fonts, scalable)
pdf.add_font(fname=os.path.join(fonts_dir,"KhmerOS.ttf"))
pdf.set_font("KhmerOS", size=14)
pdf.write(8, "Khmer: សួស្តី ពិភពលោក")
pdf.ln(20)

# This is KhmerMoul is most use in Khmer language for official documents (formerly Moul style fonts, scalable)
pdf.add_font(fname=os.path.join(fonts_dir,"KhmerMoul.ttf"))
pdf.set_font("KhmerMoul", size=14)
pdf.write(8, "Khmer: សួស្តី ពិភពលោក")
pdf.ln(20)

fn = "unicode.pdf"
pdf.output(fn)
...

Environment

Operating System: Windows,
Python version: 3.11.3
fpdf2 version used: 2.7.9

kuth-chi commented 1 month ago

Issued is solved, Please close. This, thank you

gmischler commented 1 month ago

Issued is solved, Please close. This, thank you

Good to hear! Can you explain what the problem was, and how you solved it? Knowing that might be helpful for others, if they encounter a similar issue.

kreier commented 4 weeks ago

I guess the text shaping in line 9 needs to be specific regarding script and language. This worked for me with similar problems. The shape engine does not automatically detect the font and language used. It might be implemented for one of the 173 scripts, but certainly not for one of the 634 possible languages used by these scripts. Therefore in most cases (if it's more than ligatures in English documents) this information needs to be added. My solution was

pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")

But let's see what @kuth-chi answers.

kuth-chi commented 4 weeks ago

Yes, just fixed for urgent usage, but still problem for general usage. We need to implement or integrate on FPDF2 properly.

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Matthias Kreier @.> Sent: Sunday, June 2, 2024 1:26:11 AM To: py-pdf/fpdf2 @.> Cc: Kuth @.>; Mention @.> Subject: Re: [py-pdf/fpdf2] Issue on Khmer Unicode Font Subscripts (Issue #1187)

I guess the text shaping in line 9 needs to be specific regarding script and language. This worked for me with similar problems. The shape engine does not automatically detect the font and language used. It might be implemented for one of the 173 scriptshttps://learn.microsoft.com/en-us/typography/opentype/spec/scripttags, but certainly not for one of the 634 possible languageshttps://learn.microsoft.com/en-us/typography/opentype/spec/languagetags used by these scripts. Therefore in most cases (if it's more than ligatures in English documents) this information needs to be added. My solution was

pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")

But let's see what @kuth-chihttps://github.com/kuth-chi answers.

— Reply to this email directly, view it on GitHubhttps://github.com/py-pdf/fpdf2/issues/1187#issuecomment-2143540539, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQSLHP2HMELZTU62IVWYM6TZFIG4HAVCNFSM6AAAAABIODJDR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBTGU2DANJTHE. You are receiving this because you were mentioned.Message ID: @.***>

kreier commented 4 weeks ago

I've seen the _textshaping working with pdf.cell() and pdf.write() (like in this example here). But I wanted to achieve exact positioning on the page for each string, and used pdf.text(). The shaping engine seems not to work with pdf.text(). Here the example code:

# example rendering Khmer
from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King        - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា",     new_x="LMARGIN", new_y="NEXT")
pdf.set_font("Helvetica", size=12)
pdf.cell(h = 20,text="Now using __text_shaping__ with **uharfbuzz** in pdf.cell() and pdf.write():", markdown=True, new_x="LMARGIN", new_y="NEXT")
pdf.set_font("noto", size=32)
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King        - ស្តេច", new_x="LMARGIN", new_y="NEXT")
pdf.cell(text="Prophet - ហោរា",     new_x="LMARGIN", new_y="NEXT")
pdf.write(text="King - ស្តេច, Prophet - ហោរា")
pdf.text(10, 110, "King        - ស្តេច")
pdf.text(10, 121, "Prophet - ហោរា")
pdf.set_font("Helvetica", size=12)
pdf.text(10, 95, "Does not work if you use pdf.text() instead of pdf.cell()")
pdf.output("example_fpdf.pdf")

Output:

kreier commented 4 weeks ago

I used to get the exact location with pdf.text() pdf.text(x_value, y_value, textstring). Textshaping does not work with pdf.text. The solution with pdf.cell() needs a few additional steps:

pdf.set_margin(0)             # avoid overflow to next page if text is to close to any border
pdf.c_margin = 0              # removes the 2.83pt or 1mm margin from the start of the string
pdf.set_xy(x_value, y_value)
pdf.cell(textstring)

I could not find the .c_margin = 0 in the documentation. Might be a possible update. Text shaping works now, but I discovered two challenges. One is related to pdf.get_string_width(text) for the above Khmer examples. I use it to render some strings with right align, but the results are inconsistent. See below:

If I deactivate the shape engine pdf.set_text_shaping(use_shaping_engine=False the returned values are correct:

The second problem is that the Unicode sequence originally representing the Khmer text is replaced with the character sequence that gives the correct result for the rendering engine - but no longer represents a Khmer Unicode text. If I highlight the text in my PDF viewer, the copied sequence contains unintelligible codepoints. Can this be separated in the source code - rendering instructions and what it should represent in Unicode?

andersonhc commented 4 weeks ago

I guess the text shaping in line 9 needs to be specific regarding script and language. This worked for me with similar problems. The shape engine does not automatically detect the font and language used. It might be implemented for one of the 173 scripts, but certainly not for one of the 634 possible languages used by these scripts. Therefore in most cases (if it's more than ligatures in English documents) this information needs to be added. My solution was
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
But let's see what @kuth-chi answers.

When you don't specify the script and language, fpdf2 will call harfbuzz with the option to guess the text properties. Reading the documentation of the guess function it applies the script of the first character if finds with a valid unicode script.

In this example the string has english + Khmer, so it's not applying the Khmer shaping because it's guessing the text is latin script since the english text is sent first.

andersonhc commented 4 weeks ago

I've seen the _textshaping working with pdf.cell() and pdf.write() (like in this example here). But I wanted to achieve exact positioning on the page for each string, and used pdf.text(). The shaping engine seems not to work with pdf.text().

fpdf.text() is deprecated and only kept in the codebase for backwards compatibility. It doesn't support text shaping, markdown and many other features.

andersonhc commented 4 weeks ago

Text shaping works now, but I discovered two challenges. One is related to pdf.get_string_width(text) for the above Khmer examples. I use it to render some strings with right align, but the results are inconsistent. See below:

The second problem is that the Unicode sequence originally representing the Khmer text is replaced with the character sequence that gives the correct result for the rendering engine - but no longer represents a Khmer Unicode text. If I highlight the text in my PDF viewer, the copied sequence contains unintelligible codepoints. Can this be separated in the source code - rendering instructions and what it should represent in Unicode?

Thank you for reporting these issues. I will take a look as soon as possible.

kuth-chi commented 4 weeks ago

I guess the text shaping in line 9 needs to be specific regarding script and language. This worked for me with similar problems. The shape engine does not automatically detect the font and language used. It might be implemented for one of the 173 scripts, but certainly not for one of the 634 possible languages used by these scripts. Therefore in most cases (if it's more than ligatures in English documents) this information needs to be added. My solution was
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
But let's see what @kuth-chi answers.
When you don't specify the script and language, fpdf2 will call harfbuzz with the option to guess the text properties. Reading the documentation of the guess function it applies the script of the first character if finds with a valid unicode script.

In this example the string has english + Khmer, so it's not applying the Khmer shaping because it's guessing the text is latin script since the english text is sent first.

I have abstract with uharfbuzz shaping, but it's not a good way to do.

We would to solve the issue with can use generally in FPDF2.

kreier commented 4 weeks ago

Text shaping works now, but I discovered two challenges. One is related to pdf.get_string_width(text) for the above Khmer examples. I use it to render some strings with right align, but the results are inconsistent. See below: The second problem is that the Unicode sequence originally representing the Khmer text is replaced with the character sequence that gives the correct result for the rendering engine - but no longer represents a Khmer Unicode text. If I highlight the text in my PDF viewer, the copied sequence contains unintelligible codepoints. Can this be separated in the source code - rendering instructions and what it should represent in Unicode?

Thank you for reporting these issues. I will take a look as soon as possible.

I created an example script to demonstrate the two problems. With some selected examples its easier to see. The measurements for pdf.get_string_width(text) and the updated x-position of pdf.get_x seem to match. But they are actually not correct. I added another symbol after the rendered cell, and in some cases it actually overlaps with the rendered string. See pictures below

The example also demonstrates the second problem with the replaced copy/paste values of the rendered string. The three example strings in the pdf are initially all the same. One could easily compare and test them with Google Translate and copy/paste:

# example rendering Khmer
from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.c_margin = 0
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
# teststring = ["ពុម្ពអក្សរស្មុគស្មាញ", "គឺជាលក្ខណៈ", "នៃភាសាខ្មែរ។"]
# teststring = ["អំរី (ម្នាក់ឯង)", "យេរ៉ូបោម", "យ៉ូសៀស"]
teststring = ["ស៊ីមរី (៧ ថ្ងៃ)", "អេឡា (២ ឆ្នាំ)", "អំរី ( ម្នាក់ ឯង )  (៨ ឆ្នាំ)"]
pdf.set_font("noto", size=12)
pdf.cell(text="The following text consists of three cells. We determine the width before and after rendering.")
pdf.ln()

pdf.set_font('noto', size=24)
teststringlength = []
teststringlength_measured = []
for i in range(len(teststring)):
    teststringlength.append(pdf.get_string_width(teststring[i]))
    start = pdf.get_x()
    pdf.cell(h=17, text=teststring[i])
    end = pdf.get_x()
    pdf.cell(h=17, text="—")
    teststringlength_measured.append(end - start)
pdf.ln()
pdf.set_font("noto", size=12)
for i in range(len(teststringlength)):
    pdf.cell(text=f"Before: {teststringlength[i]}  -  after: {teststringlength_measured[i]}")
    pdf.ln()
pdf.ln()

pdf.cell(text="Now activating the shape engine and try this again:")
pdf.ln()
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")

pdf.set_font('noto', size=24)
teststringlength = []
teststringlength_measured = []
for i in range(len(teststring)):
    teststringlength.append(pdf.get_string_width(teststring[i]))
    start = pdf.get_x()
    pdf.cell(h=17, text=teststring[i])
    end = pdf.get_x()
    pdf.cell(h=17, text="—")
    teststringlength_measured.append(end - start)
pdf.ln()
pdf.set_font("noto", size=12)
for i in range(len(teststringlength)):
    pdf.cell(text=f"Before: {teststringlength[i]}  -  after: {teststringlength_measured[i]}")
    pdf.ln()
pdf.ln()

pdf.c_margin = 1
pdf.set_font('noto', size=24)
for i in range(len(teststring)):
    pdf.cell(text=teststring[i])
    pdf.cell(text="—")
pdf.ln()
pdf.cell(h=17, text=teststring[0]+"—"+teststring[1]+"—"+teststring[2]+"—")

pdf.output("example_fpdf2_strignwidth.pdf")

The strange overlap for the middle rendering is a result of reducing the cell margin pdf.c_margin = 0 to zero. With the expected value of 1 millimeter it looks better, but only the combined string rendered in a single run (seen in the last line) is correct:

kreier commented 3 weeks ago

I updated my comment with other strings and an additional dash after the three example strings to visualize that both the results of pdf.get_string_width(text) and the updated x-position pdf.get_x are incorrect after applying the shape engine.

gmischler commented 3 weeks ago

Thanks for researching this, @kreier !

It would be interesting to figure out in more detail which glyph combinations can cause unexpected width determinations, and why.

Unfortunately, it is quite possible that we will be unable to create a fix within fpdf2, since we are dependent on the harfbuzz library to do the actual work, and harfbuzz is again dependent on the information it finds in the font file.

fpdf2 accepts a sequence of characters, and passes it to pyharfbuzz.
pyharfbuzz converts the python string to a C structure and passes it to harfbuzz.
harfbuzz consults the font file, combines character sequences into glyph clusters, and adds the width information given in the font file to each cluster.
pyharfbuzz converts the result back into python data
fpdf2 uses the returned width information for line wrapping, and adds the resulting line data into the PDF stream.
A PDF viewer reads that stream, and needs to figure out where to place the glyphs on the page.

Theoretically, at any of the above steps something could go wrong. In practise, the font file, harfbuzz' interpretation of its contents, and the rendering by the viewer are most error prone. The python components primarily just pass on the information they receive.

The khmer script uses many ligatures, conjuncts, and diacritics, so harfbuzz needs to perform many substitutions. If a font file contains contains slightly incorrect width information for just one of the combined glyphs, then that may result in the effect you are observing.

Did you try the same experiment with a different font? Do you think you can figure out which specific character combination(s) is/are causing the positional shift?

kreier commented 3 weeks ago

Hi @gmischler,

Thanks for the detailed breakdown of the steps involved! I'm certainly have to come back to these steps at one point in the future. As with Khmer, I neither know the script nor language, so it's difficult to see which Unicode codepoints combined give a letter, consonant, conjunct, ligature or add diacritics to form a word, sentence or idiom. I'm planning to visit Cambodia at the end of the month to talk to some native speakers.

But I just found the solution for problem Nr. 1: It's the font (or typeface)! Both Noto Khmer sans and serif produce the problem (see below). Replacing the font with the Google font for Khmer solved the problem immediately. Here my new test code and the correct result:

from fpdf import FPDF
pdf = FPDF(orientation="P", unit="mm", format="A4")
pdf.add_page()
pdf.c_margin = 0
# pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.add_font("noto", style="", fname="../../fonts/Khmer-Regular.ttf")
# strings:    years,  alone,    1st year, sword,  fork, objects
teststrings = ["ឆ្នាំ", "ម្នាក់ ឯង", "ឆ្នាំទី១", "ដង្កាវ", "ង្គ្រា", "វត្ថុ"]

def render_strings(teststrings):
    pdf.set_font('noto', size=24)
    pdf.set_draw_color(160)
    pdf.set_line_width(0.3)
    for string in teststrings:
        pdf.rect(pdf.get_x(), pdf.get_y()+2, pdf.get_string_width(string), 13, style="D")
        pdf.cell(h=17, text=string + " ")
    pdf.ln()

def info(text):
    pdf.set_font("Helvetica", size=12)
    pdf.cell(text=text)
    pdf.ln()

info("Rendering without shape engine:")
render_strings(teststrings)

info("Now activating the shape engine and try this again:")
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
render_strings(teststrings)

render_strings([''.join(teststrings)])
pdf.output("fpdf2_stringwidth.pdf")

For comparison, this is the result with NotoSansKhmer-Regular.ttf with under-measurement of pdf.string_width:

kreier commented 3 weeks ago

The error is now reported at https://github.com/notofonts/notofonts.github.io/issues/46

andersonhc commented 3 weeks ago

I just merged #1193. It should solve:

Automatically detecting the script, so you don't have to force the script and language when you enable text shaping.
Error in the text width calculation.

I still don't have a final solution for the third problem—an extra character being added when you copy/paste from your PDF. The issue arises in scenarios where the text shaper resolves two glyphs with a single Unicode code (normally, ligatures are one glyph per two codes). fpdf2 creates a mapping associating the first glyph with the Unicode code and doesn't associate anything with the second glyph. However, the reader doesn't handle this well and copies the glyph ID when no associated character is found.

I will probably have to implement Tagged pdf /ActualText to handle this case.

andersonhc commented 3 weeks ago

You can install the most recent version on

pip install git+https://github.com/py-pdf/fpdf2.git@master

kreier commented 3 weeks ago

I updated with pip install git+https://github.com/py-pdf/fpdf2.git@master and still get the same result. pip show fpdf2 just shows the version number 2.7.9. I hope that the updated library is used (no error message from installation) but is there a way to know since the version number is still the same?

Simon Cozens replied to the https://github.com/notofonts/notofonts.github.io/issues/46 get_string_width issue and mentions advance width of mark attached glyphs that should be taken into account. I don't have a glyph editor or viewer yet so look for these parameters. Maybe I'll try glyphsapp with the 30 days trial. And he was right - all the other fonts I used for Khmer were created by Dan Hong!

simoncozens commented 3 weeks ago

I updated with pip install git+https://github.com/py-pdf/fpdf2.git@master and still get the same result.

Works for me:

andersonhc commented 3 weeks ago

I updated with pip install git+https://github.com/py-pdf/fpdf2.git@master and still get the same result. pip show fpdf2 just shows the version number 2.7.9. I hope that the updated library is used (no error message from installation) but is there a way to know since the version number is still the same?

Simon Cozens replied to the notofonts/notofonts.github.io#46 get_string_width issue and mentions advance width of mark attached glyphs that should be taken into account. I don't have a glyph editor or viewer yet so look for these parameters. Maybe I'll try glyphsapp with the 30 days trial. And he was right - all the other fonts I used for Khmer were created by Dan Hong!

Try uninstalling fpdf2 with pip uninstall fpdf2, make sure it's uninstalled on pip freeze and then install from github

kreier commented 3 weeks ago

Thank you @andersonhc , that was it! The fix worked. Here my newly rendered test line with the same NotoSansKhmer-Regular.ttf:

And my initial problem is solved, too. Right align looks good:

gmischler commented 3 weeks ago

Theoretically, at any of the above steps something could go wrong. In practise, the font file, harfbuzz' interpretation of its contents, and the rendering by the viewer are most error prone. The python components primarily just pass on the information they receive.

As it turns out, passing on information can also go wrong... 😉 Thanks @simoncozens for chiming in, not surprisingly your diagnosis was much more accurate than my own. 👍

py-pdf / fpdf2

Issue on Khmer Unicode Font Subscripts #1187