Extended Latin and Viet subsets missing many characters

jvgaultney commented 1 year ago

This is the fourth place I've submitted this issue in the last few months, as there is still no progress. See also https://github.com/google/fonts/issues/5385 https://github.com/google/fonts/issues/3756 https://github.com/googlefonts/lang/issues/30

A large number of extended Latin and Vietnamese characters are not displaying properly. These characters are being displayed with fallback fonts even if the characters are supported in the fonts.

In the following screenshots LPR = local path-referenced font, GF = Google Font with subset=latin-ext,cyrillic-ext,vietnamese, FLO = our own internal font server. Screen shots are from current Chrome on Win 10.

Three specific examples:

1) Vietnamese text properly renders the Vietnamese diacritic forms when lang='vi' is set. However certain combinations with dot below are using fallback fonts. Character string in example: Ấấ Ầầ Ẩẩ Ẫẫ Ắắ Ằằ Ẳẳ Ẵẵ Ếế Ềề Ểể Ễễ Ốố Ồồ Ổổ Ỗỗ Phải áp dụng chế độ giáo dục miễn phí, ít nhất là ở bậc tiểu học và giáo dục cơ sở

194867919-3d5b5907-9aa9-49b9-8542-7532b115807a

2) Extended Latin does not seem to include some important diacritics, such as U+0329, and again fallback fonts are used. Example from Yoruba language UDHR. Character string in example: E̩nì kò̩ò̩kan ló ní è̩tó̩ láti kó̩ è̩kó̩. Ó kéré tán, è̩kó̩ gbo̩dò̩ jé̩ ò̩fé̩ ní àwo̩n ilé‐è̩kó̩ alákò̩ó̩bè̩rè̩. E̩kó̩ ní ilé‐è̩kó̩ alákò̩ó̩bè̩rè̩ yìí sì gbo̩dò̩ jé̩ dandan. A gbo̩dò̩ pèsè è̩kó̩ is̩é̩‐o̩wó̩, àti ti ìmò̩‐è̩ro̩ fún àwo̩n ènìyàn lápapò̩. Àn fàní tó dó̩gba ní ilé‐è̩kó̩ gíga gbo̩dò̩ wà ní àró̩wó̩tó gbogbo e̩ni tó bá tó̩ sí.

194868043-a2fd08f9-e6a8-4eda-a2f6-f27306ca4e34

3) Many common diacritics, like ogonek, are not displaying properly Character string in example: ọ o̧ ǫ ô o˞ o̝̠̣ ô͑ n f i fi f l fl ˥ ˦ ˧ ˨ ˩ ˥˥ ˥˦ ˥˧ ˥˨ ˥˩ ˥˨˥ ˥˨˦ ˥˨˧ ˥˨˨ ˥˨˩

194868719-634ad632-6253-438e-ada8-beafb0a4bd43

simoncozens commented 1 year ago

Very few of the U+03XX combining marks appear in any of the Google Fonts glyphsets, so they will all be stripped out of fonts served via GF. We could make piecemeal PRs adding combining marks into the Latin and Vietnamese and extended Latin and whatever various other script sets use them, but it feels really yucky; it's clearly symptomatic of a larger problem. However, the engineering team sees a lot of benefit in subsetting fonts, so I'm not sure how to solve that larger problem.

simoncozens commented 1 year ago

(See also googlefonts/nam-files#7. There are a huge number of fonts on GF which offer these combining marks, but they can't be used.)

jvgaultney commented 1 year ago

Well that's a non-answer. We know it's not working, and that the combining marks are not getting included, and that it's one symptom of a larger systemic problem with GF.

However we just need something that works, even if it feels yucky to you. Even if only the more common combining diacritics were added it would make GF useful for many more languages. The lack of basic Vietnamese support is really embarrassing, when the fix is trivial.

thlinard commented 1 year ago

This is a screenshot of Roboto on https://fonts.google.com/specimen/Roboto?subset=vietnamese&noto.script=Latn (sample in Vietnamese):

Same situation for every font with Vietnamese support (.notdef displayed for ịửỡ in standard sample text).

garretrieger commented 1 year ago

FYI I made an update for this issue in googlefonts/glyphsets#102. Since this affects many families it may take a bit to get the fix rolled out to each family. For now I've already updated Noto Sans, Andika, Charissil, and Gentium Plus with the fixed subset definitions.

thlinard commented 1 year ago

FYI I made an update for this issue in googlefonts/glyphsets#102. Since this affects many families it may take a bit to get the fix rolled out to each family. For now I've already updated Noto Sans, Andika, Charissil, and Gentium Plus with the fixed subset definitions.

Hi @garretrieger

The fix is incomplete:

Example with Andika, from the API:

Andika downloaded and displayed on desktop:

Displaying other fonts is still problematic:

garretrieger commented 1 year ago

We had to partially rollback some of the fixes due to https://github.com/google/fonts/issues/6245. The problem is that the combining marks are present in the latin, latin extended, and vietnamese subsets. Selecting the subset to load/use for a particular occurrence of a combining mark is up to the browser and sometimes it doesn't use the right one.

We're experimenting with different subset definitions + unicode range setups to try and find something that works for all cases, but this is difficult. You end up fixing one case, but causing another to break.

I'm currently working on assembling a test suite that tries to cover as many of the different cases as possible. So we can evaluate potential fixes to make sure we don't regress anything.

Could you provide the specific codepoint sequences you used for the above iuo case? I'll add it to the test suite.

For Roboto, we haven't pushed updated subset definitions yet and likely won't until it's upgraded to the variable version. Unfortunately the way the layout rules are set up on the static version of Roboto causes it's subset sizes to massively increase in size when introducing the additional combining marks. This issue has been fixed in the upcoming variable version of the font.

thlinard commented 1 year ago

Thanks for the information.

For the sequences, I simply copied the problematic characters in the sample text from "Select preview text > Asia > Vietnamese", i.e.:

ị (0069 LATIN SMALL LETTER I + 0323 COMBINING DOT BELOW) ĩ (0069 LATIN SMALL LETTER I + 0303 COMBINING TILDE) ỉ (0069 LATIN SMALL LETTER I + 0309 COMBINING HOOK ABOVE) ắ (0103 LATIN SMALL LETTER A WITH BREVE + 0301 COMBINING ACUTE ACCENT) ẫ ‎(00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX + 0303 COMBINING TILDE) ụ (0075 LATIN SMALL LETTER U + 0323 COMBINING DOT BELOW) ử (01B0 LATIN SMALL LETTER U WITH HORN + ‎0309 COMBINING HOOK ABOVE)

Results vary from font to font. For example, on Lora, a VF, the results are good in Italic, bad in Roman:

garretrieger commented 1 year ago

I've been trying to reproduce your Andika example and haven't been able to: https://codepen.io/garretrieger/pen/XWyKaZq

What browser are you using?

garretrieger commented 1 year ago

This is what I get for that example: Screenshot 2023-06-19 at 6 04 01 PM

moyogo commented 1 year ago

@garretrieger U+031B is used in ử (0075 031B 0309) but it is not in the vietnamese set in https://fonts.googleapis.com/css?family=Andika. Chrome shows the example correctly but Safari and Firefox do not.

Firefox:

Safari:

There also seem to be others missing: https://github.com/googlefonts/glyphsets/pull/110#issuecomment-1598123546

thlinard commented 1 year ago

What browser are you using?

Firefox 114.0.1 on macOS 13.4.

googlefonts / nam-files

Extended Latin and Viet subsets missing many characters #6