sillsdev / TheCombine

This is a tool for supporting the rapid word collection workshop and post workshop clean-up
https://sillsdev.github.io/TheCombine/
MIT License
16 stars 13 forks source link

Language pickers shows tofu, needs font support #2644

Open imnasnainaec opened 11 months ago

imnasnainaec commented 11 months ago

The MuiLanguagePicker has some characters that aren't supported by our default UI font. For example, see the results in a search for ~"yan"~ "kyu": image

imnasnainaec commented 6 months ago

Used ChatGPT to slap together a python script to extract all localname and localnames characters from https://raw.githubusercontent.com/sillsdev/mui-language-picker/master/src/data/langtags.json (whose content is from https://github.com/silnrsi/langtags/blob/master/pub/langtags.json):

" ' , - : A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z ² À Á Ã Å È É Ê Ì Ð Ñ Ò Ó Ö à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý ā ă ą ć Č č đ ē ĕ ė ę ě ħ Ĩ ĩ Ī ī ĭ ı ļ Ł ł Ŋ ŋ Ō ō ś ŝ ş š ũ ū ŭ ů ų ŵ Ž ž Ɓ Ɔ Ɗ Ə Ɨ ơ ǀ ǎ ǝ ǩ ǫ ȟ ȯ Ɂ ɐ ɓ ɔ ɗ ə ɛ ɣ ɨ ɩ ɬ ɵ ɽ ʉ ʋ ʌ ʔ ʷ ʹ ʻ ʼ ʾ ˀ ˊ ˯ ̀ ́ ̂ ̃ ̄ ̇ ̈ ̌ ̢ ̣ ̧ ̨ ̰ ̱ ̲ ̶ ́ Ε Ν Π ά έ ή ί α ε η ι κ λ ν ο ρ σ τ ό ϯ А Б Г Д З К М Н О С Т У Х Ц Ч Ш Ю Я а б в г д е ж з и й к л м н о п р с т у ф х ц ч ш ъ ы ь э я ѓ і ї ј ў ѣ ғ ҕ Қ қ ҡ ң ҧ ү ҷ ҹ Ӏ ӄ Ӈ ӈ ӏ ӑ ӗ ә ӣ ӧ ӹ Ԓ ԓ ա ե է հ մ յ ն տ ր ւ ְ ֲ ִ ַ ָ ֹ ּ ־ ׁ א ב ג ד ה ו ח י ל מ ן נ ס ע פ ק ר ש ת آ ؤ ئ ا ب ة ت ج ح خ د ذ ر ز س ش ص ط ع غ ف ق ك ل م ن ه و ى ي َ ُ ِ ْ ٛ ٜ ٲ ٽ پ چ ڈ ڌ ڍ ڑ ښ ڢ ک ڪ گ ھ ہ ۆ ۇ ۊ ی ێ ې ە ܐ ܘ ܝ ܠ ܢ ܣ ܪ ܫ ܬ ހ ބ ވ ދ ސ ަ ި ެ ް ँ ं ः अ आ इ ई ऊ क ख ग घ ङ च छ ज झ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ऱ ल ळ व श ष स ह ़ ा ि ी ु ू ृ े ै ॉ ो ौ ् ड़ ॱ ঁ ং অ ই উ ক গ চ ছ জ ট ড ণ ত দ ন প ব ভ ম য র ল শ স হ ় া ি ী ু ৃ ে ৈ ো ্ ৰ ਜ ਪ ਬ ਭ ਸ ਼ ਾ ੀ ੰ ં આ ક ગ ચ છ જ ડ ત દ મ ય ર વ સ ા િ ી ુ ્ ଆ ଇ ଉ ଓ କ ଙ ଜ ଟ ଡ ଣ ଦ ପ ବ ମ ର ଳ ଶ ସ ଼ ା ି ୀ ୁ େ ୋ ୍ இ க ச ட ண த ன ப ம ய ர ற ள ழ ா ி ு ெ ொ ௌ ் ం ఆ ఎ ఒ క గ జ డ త ద బ య ర ఱ ల వ స ా ి ు ె ొ ో ్ ಕ ಗ ಡ ತ ನ ಬ ಭ ಳ ವ ಷ ಾ ು ೆ ೊ ್ ം ക ഗ ഡ ണ പ മ യ റ ല ള ാ ി ു ് ං ල ස හ ි ก ข ค ง ช ซ ญ ต ถ ท น บ ป ผ พ ภ ม ย ร ล ว ษ ส อ ฮ ะ ั า ำ ิ ี ื ุ ู เ ใ ไ ่ ้ ์ ກ ຍ ຕ ບ ພ ມ ຣ ລ ວ ສ ຫ ອ າ ຶ ຸ ູ ໂ ້ ໌ ་ ། ཀ ཁ ག ང ཆ ད བ མ འ ཡ ར ལ ས ི ོ ྐ ྗ ྟ ྫ ྭ ྱ ྲ က ခ င စ ဆ ည တ ဒ န ပ ဖ ဘ မ ယ ရ လ ဝ သ အ ဢ ါ ာ ိ ီ ု ေ ဲ ံ ့ း ် ြ ွ ှ ၠ ၤ ၵ ၸ ႃ ႆ ႈ ႏ ႝ ა ე თ ი ლ ნ რ უ ქ შ ሃ ሊ ላ ል ሙ ማ ም ረ ሪ ራ ር ሮ ሰ ስ ሶ ቡ ባ ቤ ብ ቱ ታ ት ኃ ነ ን ኖ ኛ አ ኡ ኢ ኣ ኬ ኾ ወ ዋ ው ዓ ዕ ዝ የ ይ ዳ ዴ ጉ ጊ ጌ ግ ጎ ጚ ጛ ጤ ፈ ፋ ፟ ፡ Ꭶ Ꭹ Ꭿ Ꮃ Ꮒ Ꮝ Ꮧ Ꮳ Ꮼ ᐃ ᐄ ᐅ ᐊ ᐍ ᐏ ᐣ ᐤ ᐦ ᐧ ᐱ ᑎ ᑐ ᑦ ᑲ ᒃ ᒧ ᒨ ᓀ ᓂ ᓃ ᓄ ᓅ ᓇ ᓐ ᓕ ᓖ ᔅ ᔑ ᔨ ᔪ ᔫ ᔭ ᖅ ᖬ ខ គ ង ត ន ព ម រ ឞ ឹ ូ ួ ែ ំ ្ ៝ ᥑ ᥒ ᥖ ᥨ ᥬ ᥭ ᥰ ᥳ ᦅ ᦑ ᦟ ᦹ ᦺ ᧄ ᧉ ᱛ ᱟ ᱤ ᱥ ᱱ ᱲ ᴐ ᶉ ḇ ḍ Ḏ ḓ ḥ ṇ ṣ ṭ ṯ ṹ ạ ẹ Ẽ ẽ ế ệ Ị ị Ọ ọ ụ ỹ Ἑ ‌ ‍ ‑ ‘ ’ ” ‧ ‬ ⁴ ↄ ⲉ ⲏ ⲓ ⲙ ⲛ ⲣ ⲧ ⲭ ⴰ ⴳ ⴼ ⵃ ⵆ ⵉ ⵌ ⵍ ⵎ ⵏ ⵓ ⵔ ⵖ ⵛ ⵜ ⵡ ⵢ ⵣ ⵥ ア イ ウ グ タ チ ナ ヌ ー ㇰ 中 佒 壮 壯 徳 文 日 本 粤 粵 繁 語 语 靖 體 ꆈ ꉙ ꌠ ꓡ ꓢ ꓲ ꓴ ꔤ ꕙ ꞌ ꤊ ꤛ ꤜ ꤟ ꤢ ꤤ ꤬ ꤭ ꩫ ꩱ ꬃ 국 어 한 ﬞ ﯣ 𑃐 𑃚 𑃝 𑄋 𑄌 𑄟 𑄦 𑄳 𑄴 𞤆 𞤢 𞤤 𞤪 𞤵

imnasnainaec commented 6 months ago

The above characters are from the following unicode ranges:

imnasnainaec commented 6 months ago

Below are maximal unicode ranges for scripts with something outside of + (Basic_Latin, Latin-1_Supplement, Latin_Extended-A, Latin_Extended-B, IPA_Extensions, Spacing_Modifier_Letters, Combining_Diacritical_Marks, Phonetic_Extensions, Phonetic_Extensions_Supplement, Latin_Extended_Additional, General_Punctuation, Superscripts_and_Subscripts, Number_Forms, Latin_Extended-D).

Greek_and_Coptic, Greek_Extended, Coptic, (+): 395-3ef, 1f19, 2c89-2cad, (41-74, 300-341, 1d10)

Cyrillic, Cyrillic_Supplement, (+): 410-513, (42-eb, 181-304, 2019, 201d)

Armenian: 561-582

Hebrew, Alphabetic_Presentation_Forms: 5b0-5ea, fb1e

Arabic, Arabic_Presentation_Forms-A, (+): 622-6d5, fbe3, (43-75, 202c)

Syriac: 710-72c

Thaana: 780-7b0

Devanagari, (General_Punctuation): 901-971, (200d)

Bengali: 981-9f0

Gurmukhi: a1c-a70

Gujarati: a82-acd

Oriya, (General_Punctuation): b06-b4d, (200c)

Tamil: b87-bcd

Telugu: c02-c4d

Kannada: c95-ccd

Malayalam: d02-d4d

Sinhala: d82-dd2

Thai: e01-e4c

Lao: e81-ecc

Tibetan: f0b-fb2

Myanmar, Myanmar_Extended-A: 1000-109d, aa6b, aa71

Georgian: 10d0-10e8

Ethiopic, Ethiopic_Extended-A, (Basic_Latin): 1203-1361, ab03, (44-77)

Cherokee: 13a6-13ec

Unified_Canadian_Aboriginal_Syllabics: 1403-15ac

Khmer, (Basic_Latin): 1781-17dd, (42-75)

Tai_Le: 1951-1973

New_Tai_Lue: 1985-19c9

Ol_Chiki: 1c5b-1c72

Tifinagh: 2d30-2d65

Katakana: 30a2-31f0

CJK_Unified_Ideographs

Yi_Syllables: a188-a320

Lisu: a4e1-a4f4

Vai: a524, a559

Kayah_Li: a90a-a92d

Hangul_Syllables: ad6d, c5b4, d55c

Sora_Sompeng: 110d0-110dd

Chakma: 1110b-11134

Adlam: 1e906-1e935

imnasnainaec commented 6 months ago

Probably good font coverage according to https://github.com/silnrsi/langfontfinder/blob/main/data/script2font.csv

Noto Sans covers: Latin, Greek, Cyrillic, Devanagari.

The following have their own Noto Sans ___: [Coptic](https://en.wikipedia.org/wiki/Coptic(Unicodeblock)), [Armenian](https://en.wikipedia.org/wiki/Armenian(Unicodeblock)), [Hebrew](https://en.wikipedia.org/wiki/Hebrew(Unicodeblock)), [Arabic](https://en.wikipedia.org/wiki/Arabic(Unicodeblock)), [Syriac](https://en.wikipedia.org/wiki/Syriac(Unicodeblock)), [Thaana](https://en.wikipedia.org/wiki/Devanagari(Unicodeblock)), [Bengali](https://en.wikipedia.org/wiki/Bengali(Unicodeblock)), [Gurmukhi](https://en.wikipedia.org/wiki/Gurmukhi(Unicodeblock)), [Gujarati](https://en.wikipedia.org/wiki/Gujarati(Unicodeblock)), [Oriya](https://en.wikipedia.org/wiki/Oriya(Unicodeblock)), [Tamil](https://en.wikipedia.org/wiki/Tamil(Unicodeblock)), [Telugu](https://en.wikipedia.org/wiki/Telugu(Unicodeblock)), [Kannada](https://en.wikipedia.org/wiki/Kannada(Unicodeblock)), [Malayalam](https://en.wikipedia.org/wiki/Malayalam(Unicodeblock)), [Sinhala](https://en.wikipedia.org/wiki/Sinhala(Unicodeblock)), [Thai](https://en.wikipedia.org/wiki/Thai(Unicodeblock)), [Lao](https://en.wikipedia.org/wiki/Lao(Unicodeblock)), [Myanmar](https://en.wikipedia.org/wiki/Myanmar(Unicodeblock)), [Georgian](https://en.wikipedia.org/wiki/Georgian(Unicodeblock)), [Ethiopic](https://en.wikipedia.org/wiki/Ethiopic(Unicodeblock)), [Cherokee](https://en.wikipedia.org/wiki/Cherokee(Unicode_block)), Canadian Aboriginal, Khmer, Tai Le, New Tai Lue, Ol Chiki, Tifinagh, Yi , Lisu, Vai, Kayah Li, Sora Sompeng, Chakma, Adlam

Covered by Noto Sans JP, Noto Sans KR, Noto Sans SC, Noto Sans TC: Katakana, CJK Unified Ideographs, Hangul

Noto Serif Tibetan: Tibetan

imnasnainaec commented 6 months ago

Per https://developers.google.com/fonts/docs/getting_started:

https://fonts.googleapis.com/css?family=Noto+Sans|Noto+Sans+JP|Noto+Sans+KR|Noto+Sans+SC|Noto+Sans+Coptic|Noto+Sans+Armenian|Noto+Sans+Hebrew|Noto+Sans+Arabic|Noto+Sans+Syriac|Noto+Sans+Thaana|Noto+Sans+Bengali|Noto+Sans+Gurmukhi|Noto+Sans+Gujarati|Noto+Sans+Oriya|Noto+Sans+Tamil|Noto+Sans+Telugu|Noto+Sans+Kannada|Noto+Sans+Malayalam|Noto+Sans+Sinhala|Noto+Sans+Thai|Noto+Sans+Lao|Noto+Sans+Myanmar|Noto+Sans+Georgian|Noto+Sans+Ethiopic|Noto+Sans+Cherokee|Noto+Sans+Canadian+Aboriginal|Noto+Sans+Khmer|Noto+Sans+Tai+Le|Noto+Sans+New+Tai+Lue|Noto+Sans+Ol+Chiki|Noto+Sans+Tifinagh|Noto+Sans+Yi|Noto+Sans+Lisu|Noto+Sans+Vai|Noto+Sans+Kayah+Li|Noto+Sans+Sora+Sompeng|Noto+Sans+Chakma|Noto+Sans+Adlam|Noto+Serif+Tibetan

... returns a 377 KB css file.

imnasnainaec commented 6 months ago

Below are the results from testing coverage of Noto Sans JP/KR/SC/TC on Katakana (10 characters), CJK_Unified (15 characters), and Hangul (3 characters)

JP: ア イ ウ グ タ チ ナ ヌ ー ㇰ   中 佒 壮 壯 徳 文 日 本 粤 □ 繁 語 □ 靖 體   □ □ □ KR: ア イ ウ グ タ チ ナ ヌ ー □   中 □ □ 壯 □ 文 日 本 □ □ 繁 語 □ 靖 體   국 어 한 SC: ア イ ウ グ タ チ ナ ヌ ー □   中 □ 壮 壯 徳 文 日 本 粤 粵 繁 語 语 靖 體   □ □ □ TC: ア イ ウ グ タ チ ナ ヌ ー □   中 佒 □ 壯 □ 文 日 本 □ 粵 繁 語 □ 靖 體   □ □ □

So TC is redundant and removed from the above link.

imnasnainaec commented 5 months ago

WS Tech is working on a font (inspired by https://github.com/santhoshtr/AutonymFont) to support precisely the autonyms present in the langtags.json that they maintain.

imnasnainaec commented 1 month ago

Here's the in-development WSTech script for generating said font: https://github.com/silnrsi/palaso-python/blob/master/scripts/font/autonyms.py

imnasnainaec commented 1 month ago

The "kyu" example doesn't show tofu anymore on QA or on thecombine.app. And more extensive spot tests yield no tofu.

imnasnainaec commented 3 weeks ago

@jmgrady Does this issue appear on the NUC and/or your offline Ubuntu deployments?

jmgrady commented 3 weeks ago

Yes, kyu shows tofu on the NUC. The language fonts installed are:

      localLangList:
        - "ar"
        - "en"
        - "es"
        - "fr"
        - "pt"
        - "zh"