Burmese script marking character placement problem with Harfbuzz / Speedata

Nainoia-Inc-Admin commented 1 month ago

This https://github.com/harfbuzz/harfbuzz/discussions/4784 discussion now moved here and below.

Last relevant comments copied below...

Thanks for working on this. I see that the unicode marks no longer clobber each other, though they are not stacked properly yet either. I also see that there are other unicode marks effected or should be effected by this new rule as well. This might be a much bigger deal.

Here is a picture of the culprit word using Padauk50002 with marks not stacked and rolling into the margin.

Padauk50002 sample

Here it is properly displayed with mmrtext.ttf

mmrtext sample

I am also noticing other errors in words in Matthew 4 with problems. Another word from Matthew 4 with Padauk50002

Padauk50002 other bad

Yet here is it displayed in mmrtext.ttf

mmrtext other good

A summary of the problem is that when there are multiple marking characters the mmrtext.ttf font stacks them so they are all there with out clobbering each other. However, Padauk50001 allowed marks to clobber each other, and now Padauk50002 prevents clobbering, but does not stack the marks with the associated letter.

What info can I provide to help?

Nainoia-Inc-Admin commented 1 month ago

The question was asked...

Not all of this is directly relevant for Padauk and I don't think we have anyone that has enough permissions on both repos to migrate wholesale. I'd just open a new issue picking up the topic and link back to this for context for how we got here.

What is that screen shot of typeset text from? A browser or something else?

The screen shots are from both a browser and from a PDF file created by Speedata / Harfbuzz. But I can assure you that the Padauk font results in the same error in both the Speedata produced PDF and in the FF browser. And that the Microsoft mmrtext.ttf font correctly displays in both the Speedata PDF and the FF browser.

Nainoia-Inc-Admin commented 1 month ago

The Speedata developer also commented...

Meanwhile wouldn't the issue of "rolling into the margin" as you put it be a problem with the typesetting engine not taking the shape reported by Harfbuzz into account when justifying the line, not an issue with the font at all?

The speedata Publisher takes the advance width into account, not any glyph widths. And the advance width reported by harfbuzz is 0. So I am not sure how this should be handled.

See original comment here https://github.com/harfbuzz/harfbuzz/discussions/4784#discussioncomment-10010331

devosb commented 1 month ago

It looks like with this GitHub issue, we don't have threads like with the previous discussion. I assume this is a difference between issues and discussions, but if not, I am happy to enable a setting to have threaded issues.

With OpenType, marks that are positioned (which includes two smaller glyphs under a base glyph in all example above AFAIKT) the advance width of the mark glyph is set to zero. Even if the mark glyph originally has width. The solution is to add an advance width to the cluster, we do this in other parts of the font (like for the solution for another medial form), we just missed this situation. Does that clarify the situation for @alerque and @pgundlach ?

Adding an advance width might be needed regardless if stacking marks (which is what I think mmrtext.ttf is doing) or a ligature is used. If the base glyph is wide enough that the stacking marks or ligature does not extend to the right of the advance width of the base glyph, then no advance width needs to be added.

Nainoia-Inc-Admin commented 1 month ago

Would it be simplest to just look at how mmrtext.ttf handles it and do the same? It should be in your Win11 fonts.

devosb commented 1 month ago

I know how mmrtext.ttf does it, the question is how much work it would be to replicate that behaviour in Padauk.

You can help by providing examples (you can save them up, you don't need to make a post each time you find one) of where stacking needs to be. I need the codepoints (including the base character) of the characters used, so either copy the text into your post or use something like Sploot! to see the codepoints in the text. I suspect all the examples you will find will be of the form CHCHC where C is a consonant and H is U+1039 MYANMAR SIGN VIRAMA.

In some of the posts you mention using a browser, and sometimes that gets clarified as Firefox. It would help if you always specify Firefox (or Chrome, or Edge, or Safari) and the OS (such as Windows 11, macOS, Android, etc).

devosb commented 1 month ago

Your comments about different shaping depending on where in the line the word is makes sense. I just cannot replicate that behaviour. Which means I am unable to test if any fix I makes resolves the original issue. In my testing the difference in shaping is due to if Graphite or OpenType shaping is used. Using what is essentially the Paduak 5.001 font (before the recent fix to Graphite) with XeTeX from TeX Live 2023 on Ubuntu 24.04

Graphite

gr1

OpenType

ot1

With the Padauk 5.002 (only two zeros, just like two zeros in 5.001 above)

Graphite

gr2

OpenType

ot2

The Graphite and OpenType shaping are still a bit different. The OpenType has more space around the second medial. This is why I suggest testing with different browsers.

If you use Padauk 5.001 from Google fonts, that font has had the Graphite tables removed, so you should get the OpenType shaping everywhere (even with Firefox). I understand the OpenType shaping is still not ideal, but it is more readable that either Graphite shaping example.

In this case, the solution seems clear (stack the medials (which are marks in OpenType) so maybe I don't need to understand the position in line part of the issue. But it is still a puzzle to me.

Nainoia-Inc-Admin commented 1 month ago

I think you are correct that if we focus on stacking all the needed medials under the associated letter that we are on track to the solution. As for Speedata PDF versus Firefox versus any other browser I frankly do not see any consequential difference in the rendering when using the same font. The software does not make the difference, but the font that I use whether Padauk5.001, Padauk5.002, or Microsoft mmrtext.ttf. The google fonts are out of the equation for me. I know they don't work properly and I have much greater hope talking with you to get Padauk working. Microsoft mmrtext.ttf is useful because it does work in all the browsers and Speedata and so maybe it can help us get Padauk working.

Concerning the medial that overflows into the margin. Do not make too big a deal about that. That only happens in Speedata/Harfbuzz PDFs because only that application has a rigid hard line margin and Speedata does not have enough information from the font to be aware that the medial is in the margin. Browser displays do not have full justification on the right margin and so the problem is not as big a problem in browser display. But it is a problem in the PDF.

I will try to begin building a list of problem medials, though that seems a challenge. I will have to tediously compare the text rendering using mmrtext.ttf versus Padauk5.002 to locate the problems. I guess it would not be possible to study the mmrtext.ttf font itself to read its tables to locate all the exceptions we need? Maybe that kind of reverse engineering is not allowed from a copyrighted font?

Alternatively maybe there is a Myanmar / Burmese speaker who can simply list out all the cases that we need to know about. I am in touch with the Sanskrit Bible maintainer and will direct him to this issue page. Maybe he can help us.

I know you want just one comprehensive list of medials and I will see what I can do. However, until then here are a few more cases to learn from, all from Matthew 4.

The github.com comment in FF falls back to mmrtext.ttf from my Win11 box This website http://www.sanskritbible.in/assets/txt/burmese/40004.html in FF falls back to mmrtext.ttf from my Win11. This website https://www.aionianbible.org/Bibles/Sanskrit---Burmese-Script/Matthew/4 uses Padauk5.001. This website https://stage.aionianbible.org/Bibles/Sanskrit---Burmese-Script/Matthew/4 uses Padauk5.002.

This word in Matthew 4:3 is displayed differently by all three web pages above. ဘဝေသ္တရှျာဇ္ဉယာ
The mmrtext.ttf appears to be the correct rendering and when I view the medials right now in this comment it appears correct because Firefox is falling back to using mmrtext.ttf from my Win11 computer.

This word from Matthew 4:12 likewise is different in all three webpages above. တဒွါရ္တ္တာံ Again the mmrtext.ttf is sensible both on the Sanskrit website and in my comment as I type this comment because both fall back to mmrtext.ttf from my Win11 box. However, on both AionianBible pages the medials are not positioned right with Padauk 5.001 or 5.002.

This word from Matthew 4:13 same story... သီမ္နောရ္မဓျဝရ္တ္တီ

I think you are saying we need a comprehensive list of all the letters that have these double, triple, and quadrupal medials associated with them. That could be a short list or perhaps very long.

devosb commented 1 month ago

Looking at mmrtext.ttf to see what medials it handles might not be allowed, as you mention. But I can make a page of all possible medials and use that font and see what it handles.

For now, the data you have found is helpful. No need to look at lots of data to visually compare. I was thinking that you might search the text with a program to find all the example of CHCHC where C is a consonant and H is U+1039 MYANMAR SIGN VIRAMA. Well, I forgot about characters that have names starting with MYANMAR CONSONANT SIGN MEDIAL. I will have to think about that.

What do you mean by Myanmar / Burmese? Myanmar is a script and a country. Burmese is a people group and a language. Since other Myanmar script fonts do not handle two medials (except for mmrtext.ttf) I would guess the Burmese language does not need have two medials, but I guess the Sanskrit language in Myanmar script does. So a Burmese speaker might not know the answer we need.

How are you counting the medials? For CHCHC, that results in a base character and two medials below. I would call that a double medial. Would you call that a double? For each of the examples above, plus the original example, how many medials do you count?

Nainoia-Inc-Admin commented 1 month ago

Oh maybe I can help with that and a little regex. Let me work on getting the unique consonant characters. I did find the sequence counts...

If you know regex... 0 counts of 4 medials "\x{1039}[^\s\x{1039}]{1,1}\x{1039}[^\s\x{1039}]{1,1}\x{1039}[^\s\x{1039}]{1,1}\x{1039}[^\s\x{1039}]{1,1}" 3 counts of 3 medials "\x{1039}[^\s\x{1039}]{1,1}\x{1039}[^\s\x{1039}]{1,1}\x{1039}[^\s\x{1039}]{1,1}" 2,386 counts of 2 medials "\x{1039}[^\s\x{1039}]{1,1}\x{1039}[^\s\x{1039}]{1,1}" 44,994 counts of 1 medial "\x{1039}[^\s\x{1039}]{1,1}"

I was using Myanmar and Burmese as synonyms. Sorry for the confusion.

Yes you are right a Burmese speaker may not understand these medials nor the Sanskrit language. I have just contacted a Burmese fluent friend and a Sanskrit fluent associate to see what we can learn.

Yes I would call CHCHC a double also. Though some of the medial characters look like 4 little accents, so I thought maybe there were as many as four. But my regex showed that three is the max.

Is there any other "glue" unicode that I need to search for beyond \x{1039} ?

mhosken commented 1 month ago

My bad on the scope of things that move when paired under a 101B. I've reduced it now and it should also look better in graphite at least. I've also added the sequence to our sanskrit test.

Nainoia-Inc-Admin commented 1 month ago

I built a comprehensive medial sequence checker for the Sanskrit Burmese script NT. All medial sequences as defined a character preceded by \x{1039} are listed here https://stage.aionianbible.org/Debug/Sanskrit-Burmese. Let me know if I can do anything to improve the tool. You might need to clear your browser cache.

mhosken commented 1 month ago

Thanks for this. I wonder if I might be so bold as to ask you for a simpler form of this data: one string per line that I can run through a rendering test. I only need the 3 and 2 medials lists. Each line is a single string which is the test string. TIA

Nainoia-Inc-Admin commented 1 month ago

Okay that is added. Visit the same page and refresh, https://stage.aionianbible.org/Debug/Sanskrit-Burmese. There is a list of all the unique sequences only and also a list of the unique sequences in a sample context, meaning the sequence plus one character on either side. This helps the font to display the medials better, though even the extra character doesn't always result in the proper display because more of the word is needed for the font to render it properly in some cases.

Also note that the Microsoft font still proves to be the best. However, I noticed that in some cases when I display the sequence even the Microsoft font is not displaying the medials properly because it needs the whole context of the word. However, my debug page is only display the sequence and the sequence plus one character on other side. Let me know if you need to see the context of the entire word and I can see what I can do. That would be much harder though.

Nainoia-Inc-Admin commented 1 month ago

My debug script has shown that the Sanskrit Burmese text uses the medial unicodes...

1000, 1001, 1002, 1003, 1005, 1006, 1007, 1009, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 1036, 1038, 1050, 1051, 100b, 100c, 100d, 100f, 101c, 101e

Note that the are holes in the numeric sequence. I don't know about medials and what characters are included, but it may be that my Sanskrit text does not include all the medials that the font should handle.

devosb commented 1 month ago

Thank you for the fantastic test data, that is very helpful. I don't think we need the whole word at this point.

Different languages might have different medials, just like different languages in Latin script vary as to what diacritics exist and go with what letter.

A medial is part of a consonant cluster. Consonants have an inherant vowel sound, that is why the name (and roughly the sound) of the first character is KA, not K. To get K, you have two characters, KA, VIRAMA. So the character sequence ရ္ဒ္ဓ (RA-VIRAMA-DA-VIRAMA-DHA sounds roughly like R-D-DHA. More details are in the Unicode standard.

So you first example of a triple medial ends with U+1039 MYANMAR SIGN VIRAMA, U+1036 MYANMAR SIGN ANUSVARA, I would have expected a consonant in place of the ANUSVARA so I suspect that sequence is a typo.

The stacking medials have now been improved, you can download the latest build of the font. It will still say version 5.002.

Nainoia-Inc-Admin commented 1 month ago

Okay your latest build is loaded in my test page, https://stage.aionianbible.org/Debug/Sanskrit-Burmese. Seems like the original problem sequence is corrected. Though maybe others from the unabridged list that still need work?

Nainoia-Inc-Admin commented 1 month ago

For further complication here is what my Burmese speaking friend says...

===

What I understood from the link is that the Padauk font is incorrectly stacking. To me the one indicated "Incorrect" and "correct" are both bad. I have never seen triple stacked consonants in Burmese language even though the link you sent indicates that it is stacking correctly by mmrtext.ttf. Someone more proficient in Burmese language than I might say otherwise. My Burmese is more of a high school level. I have a Burmese Bible and checked there and not finding the words above in Matthew 4. I also looked up other versions of the Burmese Bibles online and not finding those words in any of them. So, I am wondering if examples in the link could be from the Bible for a people group called Karen in Myanmar. Their language looks similar to Burmese.
Not sure what you meant by accent marks but assuming you meant what is known as vowels and medials used for tones in Burmese. Burmese language in general uses and needs vowels and medials and not just for a particular font type.

A word is typically structured with one or more consonants with one or more vowels and medials. A missing vowel and or medial can change the meaning of a word. It is possible for some words to have two consonants stacked.

The space between words are not critical but it is not good to add a space between a consonant, vowels and medials. Also, you would not want to break consonants from vowels for wrapping to the next line.

===

I am currently corresponding with her for more explanation about better placement of the medials.

Nainoia-Inc-Admin commented 1 month ago

This document may also help, https://www.loc.gov/catdir/cpso/romanization/burmese.pdf

burmese.pdf

devosb commented 3 weeks ago

All the needed medials for Sanskrit were added. The positioning can be improved, but you can test the font as is.

Nainoia-Inc-Admin commented 3 weeks ago

In my initial tests of build #613, the medial placement is excellent with the Speedata / Harfbuzz rendering engine. Medials are stacked nicely without clobbering one another and none overflow into the margin. There are a few cases where the medials could be nudged a bit so better align with each other and with their letter and one case where medials did clash. For example,

Screenshot 2024-07-30 232700 Screenshot 2024-07-30 232749 Screenshot 2024-07-30 232831 Screenshot 2024-07-30 233010

Here is the whole document with Paduak5.002 https://stageresources.aionianbible.org/Holy-Bible---Sanskrit---Burmese-Script---Aionian-Edition.pdf

Now strangely the HTML was not as good. When tested in FF, Edge, and Chrome the medials where better than before, but still not as good as the Microsoft font or as good as the rendering in Speedata / Harfbuzz. You can see examples yourself online. Matthew 4:12 middle word is a good example. When multiple medials some slide too far to the right.

https://stage.aionianbible.org/Bibles/Sanskrit---Burmese-Script/Matthew/4

The debug tool is also still available, though I removed Paduak5.001 from the display and show Paduak5.002 only. https://stage.aionianbible.org/Debug/Sanskrit-Burmese

Thanks for all the good work and let me know how I can help further.

Nainoia-Inc-Admin commented 2 days ago

Any progress with further repair to the font?

silnrsi / font-padauk

Burmese script marking character placement problem with Harfbuzz / Speedata #52