seanghay / awesome-khmer-language

A large collection of Khmer language resources. Khmer is a language used by Cambodia.
83 stars 16 forks source link

Question about support for rendering Khmer text correctly in various environments #1

Open mbert opened 1 month ago

mbert commented 1 month ago

Of course this is not an "issue", merely a bunch of questions hoping that someone here may be able to help.

While learning Khmer I have also got into Khmer script. I have found that while most word processors can render Khmer texts well, this does not seem to be the case for other environments:

On the Linux command line, of course with Unicode Khmer fonts installed, rendering consonants with foots and vowels are fails: e.g. something like រ្ធី looks like រ with a "+" underneath and the vowel missing altogether. On my Mac this works quite well, even on the command line.

The same effect seems to occur when trying to render PDF files using some PDF rendering library from program code.

Just wondering: do terminal emulators or PDF renderers need some particular functionality for rendering Khmer text correctly?

seanghay commented 1 month ago
  1. Terminal emuator requires a specific font to work (monospace). Currently, there's no monospace font for Khmer language. (I got the answer from a professional Khmer typeface designer, Mr. Sovichet Tep.)
  2. In order to render Khmer text correctly rendering engine needs to have a text shaping engine that supports shaping Khmer glyphs like (https://github.com/harfbuzz/harfbuzz) and a text layout engine which understands how to break the Khmer text correctly. ICU supports breaking Khmer text via a fixed dictionary (https://unicode-org.github.io/icu/userguide/boundaryanalysis/#dictionary-based-breakiterator).
seanghay commented 1 month ago

Also checkout, https://sile-typesetter.org/examples/ and https://github.com/HOST-Oman/libraqm and https://pango.gnome.org/

mbert commented 1 month ago

Thank you very much for taking the time to reply! Also thank you very much for the links. As a long-time LaTeX user I will definitely take a closer look at sile!

Regarding the terminal emulator: is this a requirement specific to terminal emulators on Linux (or other Unices using Xorg or Wayland underneath)?

On MacOS my observation is that Khmer text is rendered correctly, but the font seems to be substituted whenever Khmer script occurs: actually the default font I use is a monospace font, but what I get for Khmer words does not look like one (because then each glyph's width and the distance to others would be identical which is clearly not the case in words like ថ្វីបើ).

When looking at Linux and other classical Unix systems I would expect things to be mostly similar to command line tools on MacOS, even though they use graphical rendering engines. But obviously there is a difference.

If I may, I also have a followup on the question regarding PDF rendering: when I want to create PDF documents programatically (say, from Java or C# code) there's libraries for doing this, like e.g. iText. For instance with iText it has turned out that one needs the pdfCalligraph extension (which is not free to use) to support Khmer text. I just cannot imagine that there are no open libraries capable of this (and thus, developers would all require commercial products)? Are there any free equivalents supporting Khmer?

Apologies for bothering you with this, you don't need to reply, but being located in Europe I have struggled a lot to find a person with a technical background and experience in these things. The reply you have given already helped a lot, អរគុណ​ច្រើន!

seanghay commented 1 month ago
  1. I personally don't have a Linux machine. I've been working on macOS all the time. The terminal emulator I am using now is Kitty (switched from Alacritty) and it does render Khmer text better than most terminals.

  2. PDF rendering or image rendering is still a problem for Khmer language. Luckily, we got sile now. For me, I use headless browsers (puppeteer) to render PDF with line breaking programmatically. If I don't need line breaking or paragraphs, I use node-canvas which use cairo under neath. I just asked C# dev here, they used PDFSharp and for PHP they used FPDF.

There's no a complete solution for now however the necessary libraries to achieve it are already here.

Projects worth checking out


Most engines are able to shape Khmer glyphs except OpenType.js, so I think the remaining issues are related to line-breaks which can be solved by using ICU BreakIterator.

I've always been interested in text rendering, and I've had attempted multiple times to build a working Khmer text layout engine but ended up abandoned the project.

mbert commented 1 month ago

Thank you so much for your help!

lukasf commented 3 weeks ago

I just asked C# dev here, they used PDFSharp and for PHP they used FPDF.

I've just stumbled over this post. I tried to generate a Khmer PDF with PDFSharp and it does not render correctly. Can you confirm that your colleagues use PDFSharp for rendering Khmer texts?

This is one example output I got using PDFSharp:

Khmer_Wrong

This is how it should look like - exact same string and same font, but rendered with a different (commercial) lib:

Khmer_Correct

It does not matter which Khmer font I try. All fonts work fine with Khmer in other applications (e.g. Word). Errors always seem to occur where multiple sings should be combined into a composite sign. Then the second one gets a "+" below and the first one stays where it is, instead of really combining. Sounds very much like the issue which the OP has in the console.

seanghay commented 3 weeks ago

@lukasf I've confirmed with them. They created the PDF file by creating an image bitmap first and wrapped the image in the PDF file using PDFSharp.

I'm not familiar with PDFSharp so I don't know what text shaping engine they are using. I can tell it's not HarfBuzz.

mbert commented 3 weeks ago

That's interesting. So if developers go down the route of rendering images and embedding them in PDF, this seems to indicate that there is no free solution to this problem available (as we've seen there are commercial solutions, but not every organisation is able or willing to go along with them)?

seanghay commented 3 weeks ago

I'm not sure about C# ecosystems. And I think because of big companies are using C# to generate reports/invoices so that these library maintainers can benefit from it and that's totally fine.

For PDFSharp, it will be able to shape Khmer text correctly if they implement HarfBuzz like SkiaSharp.