Language pickers shows tofu, needs font support

imnasnainaec commented 11 months ago

The MuiLanguagePicker has some characters that aren't supported by our default UI font. For example, see the results in a search for ~"yan"~ "kyu":

[x] Identify what fonts are needed to cover all characters in the language picker
[ ] Impose fonts on MLP from the outside (with MUI themes?)
[ ] Load necessary fonts when MLP open and discard them when it closes

imnasnainaec commented 6 months ago

Used ChatGPT to slap together a python script to extract all localname and localnames characters from https://raw.githubusercontent.com/sillsdev/mui-language-picker/master/src/data/langtags.json (whose content is from https://github.com/silnrsi/langtags/blob/master/pub/langtags.json):

" ' , - : A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z ² À Á Ã Å È É Ê Ì Ð Ñ Ò Ó Ö à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý ā ă ą ć Č č đ ē ĕ ė ę ě ħ Ĩ ĩ Ī ī ĭ ı ļ Ł ł Ŋ ŋ Ō ō ś ŝ ş š ũ ū ŭ ů ų ŵ Ž ž Ɓ Ɔ Ɗ Ə Ɨ ơ ǀ ǎ ǝ ǩ ǫ ȟ ȯ Ɂ ɐ ɓ ɔ ɗ ə ɛ ɣ ɨ ɩ ɬ ɵ ɽ ʉ ʋ ʌ ʔ ʷ ʹ ʻ ʼ ʾ ˀ ˊ ˯ ̀ ́ ̂ ̃ ̄ ̇ ̈ ̌ ̢ ̣ ̧ ̨ ̰ ̱ ̲ ̶ ́ Ε Ν Π ά έ ή ί α ε η ι κ λ ν ο ρ σ τ ό ϯ А Б Г Д З К М Н О С Т У Х Ц Ч Ш Ю Я а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш ъ ы ь э я ѓ і ї ј ў ѣ ғ ҕ Қ қ ҡ ң ҧ ү ҷ ҹ Ӏ ӄ Ӈ ӈ ӏ ӑ ӗ ә ӣ ӧ ӹ Ԓ ԓ ա ե է հ մ յ ն տ ր ւ ְ ֲ ִ ַ ָ ֹ ּ ־ ׁ א ב ג ד ה ו ח י ל מ ן נ ס ע פ ק ר ש ת آ ؤ ئ ا ب ة ت ج ح خ د ذ ر ز س ش ص ط ع غ ف ق ك ل م ن ه و ى ي َ ُ ِ ْ ٛ ٜ ٲ ٽ پ چ ڈ ڌ ڍ ڑ ښ ڢ ک ڪ گ ھ ہ ۆ ۇ ۊ ی ێ ې ە ܐ ܘ ܝ ܠ ܢ ܣ ܪ ܫ ܬ ހ ބ ވ ދ ސ ަ ި ެ ް ँ ं ः अ आ इ ई ऊ क ख ग घ ङ च छ ज झ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ऱ ल ळ व श ष स ह ़ ा ि ी ु ू ृ े ै ॉ ो ौ ् ड़ ॱ ঁ ং অ ই উ ক গ চ ছ জ ট ড ণ ত দ ন প ব ভ ম য র ল শ স হ ় া ি ী ু ৃ ে ৈ ো ্ ৰ ਜ ਪ ਬ ਭ ਸ ਼ ਾ ੀ ੰ ં આ ક ગ ચ છ જ ડ ત દ મ ય ર વ સ ા િ ી ુ ્ ଆ ଇ ଉ ଓ କ ଙ ଜ ଟ ଡ ଣ ଦ ପ ବ ମ ର ଳ ଶ ସ ଼ ା ି ୀ ୁ େ ୋ ୍ இ க ச ட ண த ன ப ம ய ர ற ள ழ ா ி ு ெ ொ ௌ ் ం ఆ ఎ ఒ క గ జ డ త ద బ య ర ఱ ల వ స ా ి ు ె ొ ో ్ ಕ ಗ ಡ ತ ನ ಬ ಭ ಳ ವ ಷ ಾ ು ೆ ೊ ್ ം ക ഗ ഡ ണ പ മ യ റ ല ള ാ ി ു ് ං ල ස හ ි ก ข ค ง ช ซ ญ ต ถ ท น บ ป ผ พ ภ ม ย ร ล ว ษ ส อ ฮ ะ ั า ำ ิ ี ื ุ ู เ ใ ไ ่ ้ ์ ກ ຍ ຕ ບ ພ ມ ຣ ລ ວ ສ ຫ ອ າ ຶ ຸ ູ ໂ ້ ໌ ་ ། ཀ ཁ ག ང ཆ ད བ མ འ ཡ ར ལ ས ི ོ ྐ ྗ ྟ ྫ ྭ ྱ ྲ က ခ င စ ဆ ည တ ဒ န ပ ဖ ဘ မ ယ ရ လ ဝ သ အ ဢ ါ ာ ိ ီ ု ေ ဲ ံ ့ း ် ြ ွ ှ ၠ ၤ ၵ ၸ ႃ ႆ ႈ ႏ ႝ ა ე თ ი ლ ნ რ უ ქ შ ሃ ሊ ላ ል ሙ ማ ም ረ ሪ ራ ር ሮ ሰ ስ ሶ ቡ ባ ቤ ብ ቱ ታ ት ኃ ነ ን ኖ ኛ አ ኡ ኢ ኣ ኬ ኾ ወ ዋ ው ዓ ዕ ዝ የ ይ ዳ ዴ ጉ ጊ ጌ ግ ጎ ጚ ጛ ጤ ፈ ፋ ፟ ፡ Ꭶ Ꭹ Ꭿ Ꮃ Ꮒ Ꮝ Ꮧ Ꮳ Ꮼ ᐃ ᐄ ᐅ ᐊ ᐍ ᐏ ᐣ ᐤ ᐦ ᐧ ᐱ ᑎ ᑐ ᑦ ᑲ ᒃ ᒧ ᒨ ᓀ ᓂ ᓃ ᓄ ᓅ ᓇ ᓐ ᓕ ᓖ ᔅ ᔑ ᔨ ᔪ ᔫ ᔭ ᖅ ᖬ ខ គ ង ត ន ព ម រ ឞ ឹ ូ ួ ែ ំ ្ ៝ ᥑ ᥒ ᥖ ᥨ ᥬ ᥭ ᥰ ᥳ ᦅ ᦑ ᦟ ᦹ ᦺ ᧄ ᧉ ᱛ ᱟ ᱤ ᱥ ᱱ ᱲ ᴐ ᶉ ḇ ḍ Ḏ ḓ ḥ ṇ ṣ ṭ ṯ ṹ ạ ẹ Ẽ ẽ ế ệ Ị ị Ọ ọ ụ ỹ Ἑ ‌ ‍ ‑ ‘ ’ ” ‧ ‬ ⁴ ↄ ⲉ ⲏ ⲓ ⲙ ⲛ ⲣ ⲧ ⲭ ⴰ ⴳ ⴼ ⵃ ⵆ ⵉ ⵌ ⵍ ⵎ ⵏ ⵓ ⵔ ⵖ ⵛ ⵜ ⵡ ⵢ ⵣ ⵥ アイウグタチナヌーㇰ中佒壮壯徳文日本粤粵繁語语靖體 ꆈ ꉙ ꌠ ꓡ ꓢ ꓲ ꓴ ꔤ ꕙ ꞌ ꤊ ꤛ ꤜ ꤟ ꤢ ꤤ ꤬ ꤭ ꩫ ꩱ ꬃ 국 어 한 ﬞ ﯣ 𑃐 𑃚 𑃝 𑄋 𑄌 𑄟 𑄦 𑄳 𑄴 𞤆 𞤢 𞤤 𞤪 𞤵

imnasnainaec commented 6 months ago

The above characters are from the following unicode ranges:

20-2d, 3a-7a Basic_Latin
b2, c0-d6, e0-ff Latin-1_Supplement
100-11b, 127-131, 13c-14d, 15b-17f Latin_Extended-A
180-197, 1a1, 1c0, 1ce, 1dd, 1e9-1eb, 21f, 22f, 241 Latin_Extended-B
250-26c, 275-27d, 289-294 IPA_Extensions
2b7-2c0, 2ca, 2ef Spacing_Modifier_Letters
300-30c, 322-336, 341 Combining_Diacritical_Marks
395-3a0, 3ac-3cc, 3ef Greek_and_Coptic
410-463, 493-4d9, 4e3-4e7, 4f9 Cyrillic
512-513 Cyrillic_Supplement
561-567, 570-576, 57f-582 Armenian
5b0-5c1, 5d0-5ea Hebrew
622-652, 65b-65c, 672, 67d-691, 69a-6af, 6be-6d5 Arabic
710-72c Syriac
780-790, 7a6-7b0 Thaana
901-90a, 915-94d, 95c, 971 Devanagari
981-989, 995-9cd, 9f0 Bengali
a1c, a2a-a2d, a38-a40, a70 Gurmukhi
a82-a86, a95-ac1, acd Gujarati
b06-b09, b13-b4d Oriya
b87, b95-bb4, bbe-bcd Tamil
c02-c4d Telugu
c95-c97, ca1-ccd Kannada
d02, d15-d17, d21-d33, d3e-d41, d4d Malayalam
d82, dbd-dc4, dd2 Sinhala
e01-e4c Thai
e81, e8d-eb9, ec2-ecc Lao
f0b-f0d, f40-f46, f51-f66, f7c, f90-f9f, fab-fb2 Tibetan
1000-1022, 102b-103e, 1060-1064, 1075-1078, 1083-108f, 109d Myanmar
10d0-10e8 Georgian
1203-120d, 1219-121d, 1228-1236, 1261-1265, 1271-1275, 1283, 1290-12a3, 12ac, 12be, 12c8-12dd, 12e8-12dd, 12e8-12f4, 1309-130e, 131a-131b, 1324, 1348-134b, 135f-1361 Ethiopic
13a6-13b3, 13c2, 13d7, 13e3, 13ec Cherokee
1403-140f, 1423-1427, 1431, 144e-1450, 1466, 1472, 1483, 14a7-14a8, 14c0-14c7, 14d0-14d6, 1505, 1511, 1528-152d, 1585, 15ac Unified_Canadian_Aboriginal_Syllabics
1781-1784, 178f-179e, 17b9-17c6, 17d2, 17dd Khmer
1951-1956, 1968-1973 Tai_Le
1985, 1991, 199f, 19b9-19ba, 19c4-19c9 New_Tai_Lue
1c5b-1c65, 1c71-1c72 Ol_Chiki
1d10 Phonetic_Extensions
1d89 Phonetic_Extensions_Supplement
1e07-1e13, 1e25, 1e47, 1e63, 1e6d-1e6f, 1e79, 1ea1, 1eb9-1ecd, 1ee5, 1ef9 Latin_Extended_Additional
1f19 Greek_Extended
200c-201d, 2027-202c General_Punctuation
2074 Superscripts_and_Subscripts
2184 Number_Forms
2c89-2cad Coptic
2d30-2d33, 2d3c-2d65 Tifinagh
30a2-30a6, 30b0, 30bf-30c1, 30ca-30cc, 30fc, 31f0 Katakana
4e2d, 4f52, 58ee-58ef, 5fb3, 6587, 65e5, 672c, 7ca4, 7cb5, 7e41, 8a9e, 8bed, 9756, 9ad4 CJK_Unified_Ideographs
a188, a259, a320 Yi_Syllables
a4e1-a4e2, a4f2-a4f4 Lisu
a524, a559 Vai
a78c Latin_Extended-D
a90a, a91b-a92d Kayah_Li
aa6b-aa71 Myanmar_Extended-A
ab03 Ethiopic_Extended-A
ad6d, c5b4, d55c Hangul_Syllables
fb1e Alphabetic_Presentation_Forms
fbe3 Arabic_Presentation_Forms-A
110d0, 110da-110dd Sora_Sompeng
1110b-1110c, 1111f-11126, 11133-11134 Chakma
1e906, 1e922-1e92a, 1e935 Adlam

imnasnainaec commented 6 months ago

Below are maximal unicode ranges for scripts with something outside of + (Basic_Latin, Latin-1_Supplement, Latin_Extended-A, Latin_Extended-B, IPA_Extensions, Spacing_Modifier_Letters, Combining_Diacritical_Marks, Phonetic_Extensions, Phonetic_Extensions_Supplement, Latin_Extended_Additional, General_Punctuation, Superscripts_and_Subscripts, Number_Forms, Latin_Extended-D).

Greek_and_Coptic, Greek_Extended, Coptic, (+): 395-3ef, 1f19, 2c89-2cad, (41-74, 300-341, 1d10)

Cyrillic, Cyrillic_Supplement, (+): 410-513, (42-eb, 181-304, 2019, 201d)

Armenian: 561-582

Hebrew, Alphabetic_Presentation_Forms: 5b0-5ea, fb1e

Arabic, Arabic_Presentation_Forms-A, (+): 622-6d5, fbe3, (43-75, 202c)

Syriac: 710-72c

Thaana: 780-7b0

Devanagari, (General_Punctuation): 901-971, (200d)

Bengali: 981-9f0

Gurmukhi: a1c-a70

Gujarati: a82-acd

Oriya, (General_Punctuation): b06-b4d, (200c)

Tamil: b87-bcd

Telugu: c02-c4d

Kannada: c95-ccd

Malayalam: d02-d4d

Sinhala: d82-dd2

Thai: e01-e4c

Lao: e81-ecc

Tibetan: f0b-fb2

Myanmar, Myanmar_Extended-A: 1000-109d, aa6b, aa71

Georgian: 10d0-10e8

Ethiopic, Ethiopic_Extended-A, (Basic_Latin): 1203-1361, ab03, (44-77)

Cherokee: 13a6-13ec

Unified_Canadian_Aboriginal_Syllabics: 1403-15ac

Khmer, (Basic_Latin): 1781-17dd, (42-75)

Tai_Le: 1951-1973

New_Tai_Lue: 1985-19c9

Ol_Chiki: 1c5b-1c72

Tifinagh: 2d30-2d65

Katakana: 30a2-31f0

CJK_Unified_Ideographs

4e2d, 6587, 7e41, 9ad4
4f52, 58ee-58ef, 5fb3, 7ca4, 8bed, 9756
65e5, 672c, 7cb5, 8a9e

Yi_Syllables: a188-a320

Lisu: a4e1-a4f4

Vai: a524, a559

Kayah_Li: a90a-a92d

Hangul_Syllables: ad6d, c5b4, d55c

Sora_Sompeng: 110d0-110dd

Chakma: 1110b-11134

Adlam: 1e906-1e935

imnasnainaec commented 6 months ago

Probably good font coverage according to https://github.com/silnrsi/langfontfinder/blob/main/data/script2font.csv

Noto Sans covers: Latin, Greek, Cyrillic, Devanagari.

Covered by Noto Sans JP, Noto Sans KR, Noto Sans SC, ~~Noto Sans TC~~: Katakana, CJK Unified Ideographs, Hangul

Noto Serif Tibetan: Tibetan

imnasnainaec commented 6 months ago

Per https://developers.google.com/fonts/docs/getting_started:

... returns a 377 KB css file.

imnasnainaec commented 6 months ago

Below are the results from testing coverage of Noto Sans JP/KR/SC/TC on Katakana (10 characters), CJK_Unified (15 characters), and Hangul (3 characters)

JP: アイウグタチナヌーㇰ中佒壮壯徳文日本粤 □ 繁語 □ 靖體 □ □ □ KR: アイウグタチナヌー □ 中 □ □ 壯 □ 文日本 □ □ 繁語 □ 靖體 국 어 한 SC: アイウグタチナヌー □ 中 □ 壮壯徳文日本粤粵繁語语靖體 □ □ □ TC: アイウグタチナヌー □ 中佒 □ 壯 □ 文日本 □ 粵繁語 □ 靖體 □ □ □

So TC is redundant and removed from the above link.

imnasnainaec commented 5 months ago

WS Tech is working on a font (inspired by https://github.com/santhoshtr/AutonymFont) to support precisely the autonyms present in the langtags.json that they maintain.

imnasnainaec commented 1 month ago

Here's the in-development WSTech script for generating said font: https://github.com/silnrsi/palaso-python/blob/master/scripts/font/autonyms.py

imnasnainaec commented 1 month ago

The "kyu" example doesn't show tofu anymore on QA or on thecombine.app. And more extensive spot tests yield no tofu.

imnasnainaec commented 3 weeks ago

@jmgrady Does this issue appear on the NUC and/or your offline Ubuntu deployments?

jmgrady commented 3 weeks ago

Yes, kyu shows tofu on the NUC. The language fonts installed are:

      localLangList:
        - "ar"
        - "en"
        - "es"
        - "fr"
        - "pt"
        - "zh"

sillsdev / TheCombine

Language pickers shows tofu, needs font support #2644