Expanded language support for regional variations

michaelbromley commented 4 years ago

A recent pull request #354 adds support for Traditional Chinese (as spoken in HK and Taiwan). We already have Simplified Chinese translations. This exposes a shortcoming in our language handling: we currently only list the ISO 639-1 codes for languages, without distinguishing between regional variations.

So, e.g. we have the LanguageCode enum which is widely used throughout the core, which only has "zh" for all variations of Chinese.

We could add support for common IETF language tags (which is have used in this PR), which would then allow us to support the zh-CN and zh-TW UI translations, as well as in places such as product translations etc.

The question then is "which full IETF tags to support?"

Add all possible IETF tags

Cases like Chinese are interesting because the actual writing system is distinct between Simplified and Traditional. Whereas an American can totally understand anything written in British English and vice-versa, does the same apply to Chinese speakers in Beijing and Taipei?

How about Madrid (es-ES) and Mexico City (es-MX)? I personally cannot answer these questions, since I am only very familiar with English (and to a lesser extent German).

Adding all variations seems like a bad idea, e.g. according to https://datahub.io/core/language-codes#resource-ietf-language-tags there are over 100 variations of English and over 40 variations of French. In practice, a single "English" version of anything would probably suffice (although I say this as an English speaker from England, so perhaps my perspective is too narrow). In any case, I'm pretty sure that listing 100 variations of English in UI menus is not a desirable solution.

Add tags on an ad-hoc basis

Another approach would be to default to the ISO 639-1 codes as we currently have, and only add regional variations as-and-when the need arises (e.g. a pull request like the one that triggered this issue comes in).

In this case, we'd then need to update the LanguageCode enum as well as the language-translation-strings.ts file to allow localization of these new variants.

nbezalwar commented 4 years ago

IMO you should change the enum from LanguageCode to LocaleCode . A locale is a combination of language + country and is usually of the xx-XX format. eg. en-US or en-GB.

This also serves as a mechanism to showcase different products in your store based on country. Happy to discuss more details on slack.

michaelbromley commented 4 years ago

Prior Art

W3 Language tags in HTML and XML

https://www.w3.org/International/articles/language-tags/

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

Sylius

As far as I can tell, Sylius uses the Symphony Intl package, and on https://demo.sylius.com/admin they list 593 locales:

Sylius locales

``` Afrikaans Afrikaans (Namibia) Afrikaans (South Africa) Akan Akan (Ghana) Albanian Albanian (Albania) Albanian (North Macedonia) Amharic Amharic (Ethiopia) Arabic Arabic (Algeria) Arabic (Bahrain) Arabic (Chad) Arabic (Comoros) Arabic (Djibouti) Arabic (Egypt) Arabic (Eritrea) Arabic (Iraq) Arabic (Israel) Arabic (Jordan) Arabic (Kuwait) Arabic (Lebanon) Arabic (Libya) Arabic (Mauritania) Arabic (Morocco) Arabic (Oman) Arabic (Palestinian Territories) Arabic (Qatar) Arabic (Saudi Arabia) Arabic (Somalia) Arabic (South Sudan) Arabic (Sudan) Arabic (Syria) Arabic (Tunisia) Arabic (United Arab Emirates) Arabic (Western Sahara) Arabic (World) Arabic (Yemen) Armenian Armenian (Armenia) Assamese Assamese (India) Azerbaijani Azerbaijani (Azerbaijan) Azerbaijani (Cyrillic, Azerbaijan) Azerbaijani (Cyrillic) Azerbaijani (Latin, Azerbaijan) Azerbaijani (Latin) Bambara Bambara (Mali) Bangla Bangla (Bangladesh) Bangla (India) Basque Basque (Spain) Belarusian Belarusian (Belarus) Bosnian Bosnian (Bosnia & Herzegovina) Bosnian (Cyrillic, Bosnia & Herzegovina) Bosnian (Cyrillic) Bosnian (Latin, Bosnia & Herzegovina) Bosnian (Latin) Breton Breton (France) Bulgarian Bulgarian (Bulgaria) Burmese Burmese (Myanmar [Burma]) Catalan Catalan (Andorra) Catalan (France) Catalan (Italy) Catalan (Spain) Chechen Chechen (Russia) Chinese Chinese (Hong Kong SAR China) Chinese (Macao SAR China) Chinese (Simplified, China) Chinese (Simplified, Hong Kong SAR China) Chinese (Simplified, Macao SAR China) Chinese (Simplified, Singapore) Chinese (Simplified) Chinese (Singapore) Chinese (Taiwan) Chinese (Traditional, Hong Kong SAR China) Chinese (Traditional, Macao SAR China) Chinese (Traditional, Taiwan) Chinese (Traditional) Cornish Cornish (United Kingdom) Croatian Croatian (Bosnia & Herzegovina) Croatian (Croatia) Czech Czech (Czechia) Danish Danish (Denmark) Danish (Greenland) Dutch Dutch (Aruba) Dutch (Belgium) Dutch (Caribbean Netherlands) Dutch (Curaçao) Dutch (Netherlands) Dutch (Sint Maarten) Dutch (Suriname) Dzongkha Dzongkha (Bhutan) English English (American Samoa) English (Anguilla) English (Antigua & Barbuda) English (Australia) English (Austria) English (Bahamas) English (Barbados) English (Belgium) English (Belize) English (Bermuda) English (Botswana) English (British Indian Ocean Territory) English (British Virgin Islands) English (Burundi) English (Cameroon) English (Canada) English (Cayman Islands) English (Christmas Island) English (Cocos [Keeling] Islands) English (Cook Islands) English (Cyprus) English (Denmark) English (Dominica) English (Eritrea) English (Eswatini) English (Europe) English (Falkland Islands) English (Fiji) English (Finland) English (Gambia) English (Germany) English (Ghana) English (Gibraltar) English (Grenada) English (Guam) English (Guernsey) English (Guyana) English (Hong Kong SAR China) English (India) English (Ireland) English (Isle of Man) English (Israel) English (Jamaica) English (Jersey) English (Kenya) English (Kiribati) English (Lesotho) English (Liberia) English (Macao SAR China) English (Madagascar) English (Malawi) English (Malaysia) English (Malta) English (Marshall Islands) English (Mauritius) English (Micronesia) English (Montserrat) English (Namibia) English (Nauru) English (Netherlands) English (New Zealand) English (Nigeria) English (Niue) English (Norfolk Island) English (Northern Mariana Islands) English (Pakistan) English (Palau) English (Papua New Guinea) English (Philippines) English (Pitcairn Islands) English (Puerto Rico) English (Rwanda) English (Samoa) English (Seychelles) English (Sierra Leone) English (Singapore) English (Sint Maarten) English (Slovenia) English (Solomon Islands) English (South Africa) English (South Sudan) English (St. Helena) English (St. Kitts & Nevis) English (St. Lucia) English (St. Vincent & Grenadines) English (Sudan) English (Sweden) English (Switzerland) English (Tanzania) English (Tokelau) English (Tonga) English (Trinidad & Tobago) English (Turks & Caicos Islands) English (Tuvalu) English (U.S. Outlying Islands) English (U.S. Virgin Islands) English (Uganda) English (United Arab Emirates) English (United Kingdom) English (Vanuatu) English (World) English (Zambia) English (Zimbabwe) Esperanto Esperanto (World) Estonian Estonian (Estonia) Ewe Ewe (Ghana) Ewe (Togo) Faroese Faroese (Denmark) Faroese (Faroe Islands) Finnish Finnish (Finland) French French (Algeria) French (Belgium) French (Benin) French (Burkina Faso) French (Burundi) French (Cameroon) French (Canada) French (Central African Republic) French (Chad) French (Comoros) French (Congo - Brazzaville) French (Congo - Kinshasa) French (Côte d’Ivoire) French (Djibouti) French (Equatorial Guinea) French (French Guiana) French (French Polynesia) French (Gabon) French (Guadeloupe) French (Guinea) French (Haiti) French (Luxembourg) French (Madagascar) French (Mali) French (Martinique) French (Mauritania) French (Mauritius) French (Mayotte) French (Monaco) French (Morocco) French (New Caledonia) French (Niger) French (Réunion) French (Rwanda) French (Senegal) French (Seychelles) French (St. Barthélemy) French (St. Martin) French (St. Pierre & Miquelon) French (Switzerland) French (Syria) French (Togo) French (Tunisia) French (Vanuatu) French (Wallis & Futuna) Fulah Fulah (Cameroon) Fulah (Guinea) Fulah (Latin, Burkina Faso) Fulah (Latin, Cameroon) Fulah (Latin, Gambia) Fulah (Latin, Ghana) Fulah (Latin, Guinea-Bissau) Fulah (Latin, Guinea) Fulah (Latin, Liberia) Fulah (Latin, Mauritania) Fulah (Latin, Niger) Fulah (Latin, Nigeria) Fulah (Latin, Senegal) Fulah (Latin, Sierra Leone) Fulah (Latin) Fulah (Mauritania) Fulah (Senegal) Galician Galician (Spain) Ganda Ganda (Uganda) Georgian Georgian (Georgia) German German (Austria) German (Belgium) German (Italy) German (Liechtenstein) German (Luxembourg) German (Switzerland) Greek Greek (Cyprus) Greek (Greece) Gujarati Gujarati (India) Hausa Hausa (Ghana) Hausa (Niger) Hausa (Nigeria) Hebrew Hebrew (Israel) Hindi Hindi (India) Hungarian Hungarian (Hungary) Icelandic Icelandic (Iceland) Igbo Igbo (Nigeria) Indonesian Indonesian (Indonesia) Interlingua Interlingua (World) Irish Irish (Ireland) Irish (United Kingdom) Italian Italian (Italy) Italian (San Marino) Italian (Switzerland) Italian (Vatican City) Japanese Japanese (Japan) Javanese Javanese (Indonesia) Kalaallisut Kalaallisut (Greenland) Kannada Kannada (India) Kashmiri Kashmiri (India) Kazakh Kazakh (Kazakhstan) Khmer Khmer (Cambodia) Kikuyu Kikuyu (Kenya) Kinyarwanda Kinyarwanda (Rwanda) Korean Korean (North Korea) Korean (South Korea) Kurdish Kurdish (Turkey) Kyrgyz Kyrgyz (Kyrgyzstan) Lao Lao (Laos) Latvian Latvian (Latvia) Lingala Lingala (Angola) Lingala (Central African Republic) Lingala (Congo - Brazzaville) Lingala (Congo - Kinshasa) Lithuanian Lithuanian (Lithuania) Luba-Katanga Luba-Katanga (Congo - Kinshasa) Luxembourgish Luxembourgish (Luxembourg) Macedonian Macedonian (North Macedonia) Malagasy Malagasy (Madagascar) Malay Malay (Brunei) Malay (Malaysia) Malay (Singapore) Malayalam Malayalam (India) Maltese Maltese (Malta) Manx Manx (Isle of Man) Maori Maori (New Zealand) Marathi Marathi (India) Mongolian Mongolian (Mongolia) Nepali Nepali (India) Nepali (Nepal) North Ndebele North Ndebele (Zimbabwe) Northern Sami Northern Sami (Finland) Northern Sami (Norway) Northern Sami (Sweden) Norwegian Norwegian (Norway) Norwegian Bokmål Norwegian Bokmål (Norway) Norwegian Bokmål (Svalbard & Jan Mayen) Norwegian Nynorsk Norwegian Nynorsk (Norway) Odia Odia (India) Oromo Oromo (Ethiopia) Oromo (Kenya) Ossetic Ossetic (Georgia) Ossetic (Russia) Pashto Pashto (Afghanistan) Pashto (Pakistan) Persian Persian (Afghanistan) Persian (Iran) Polish Portuguese Portuguese (Angola) Portuguese (Brazil) Portuguese (Cape Verde) Portuguese (Equatorial Guinea) Portuguese (Guinea-Bissau) Portuguese (Luxembourg) Portuguese (Macao SAR China) Portuguese (Mozambique) Portuguese (São Tomé & Príncipe) Portuguese (Switzerland) Portuguese (Timor-Leste) Punjabi Punjabi (Arabic, Pakistan) Punjabi (Arabic) Punjabi (Gurmukhi, India) Punjabi (Gurmukhi) Punjabi (India) Punjabi (Pakistan) Quechua Quechua (Bolivia) Quechua (Ecuador) Quechua (Peru) Romanian Romanian (Moldova) Romanian (Romania) Romansh Romansh (Switzerland) Rundi Rundi (Burundi) Russian Russian (Belarus) Russian (Kazakhstan) Russian (Kyrgyzstan) Russian (Moldova) Russian (Russia) Russian (Ukraine) Sango Sango (Central African Republic) Scottish Gaelic Scottish Gaelic (United Kingdom) Serbian Serbian (Bosnia & Herzegovina) Serbian (Cyrillic, Bosnia & Herzegovina) Serbian (Cyrillic, Montenegro) Serbian (Cyrillic, Serbia) Serbian (Cyrillic) Serbian (Latin, Bosnia & Herzegovina) Serbian (Latin, Montenegro) Serbian (Latin, Serbia) Serbian (Latin) Serbian (Montenegro) Serbian (Serbia) Serbo-Croatian Serbo-Croatian (Bosnia & Herzegovina) Shona Shona (Zimbabwe) Sichuan Yi Sichuan Yi (China) Sindhi Sindhi (Pakistan) Sinhala Sinhala (Sri Lanka) Slovak Slovak (Slovakia) Slovenian Slovenian (Slovenia) Somali Somali (Djibouti) Somali (Ethiopia) Somali (Kenya) Somali (Somalia) Spanish Spanish (Argentina) Spanish (Belize) Spanish (Bolivia) Spanish (Brazil) Spanish (Chile) Spanish (Colombia) Spanish (Costa Rica) Spanish (Cuba) Spanish (Dominican Republic) Spanish (Ecuador) Spanish (El Salvador) Spanish (Equatorial Guinea) Spanish (Guatemala) Spanish (Honduras) Spanish (Latin America) Spanish (Nicaragua) Spanish (Panama) Spanish (Paraguay) Spanish (Peru) Spanish (Philippines) Spanish (Puerto Rico) Spanish (United States) Spanish (Uruguay) Spanish (Venezuela) Swahili Swahili (Congo - Kinshasa) Swahili (Kenya) Swahili (Tanzania) Swahili (Uganda) Swedish Swedish (Åland Islands) Swedish (Finland) Swedish (Sweden) Tagalog Tagalog (Philippines) Tajik Tajik (Tajikistan) Tamil Tamil (India) Tamil (Malaysia) Tamil (Singapore) Tamil (Sri Lanka) Tatar Tatar (Russia) Telugu Telugu (India) Thai Thai (Thailand) Tibetan Tibetan (China) Tibetan (India) Tigrinya Tigrinya (Eritrea) Tigrinya (Ethiopia) Tongan Tongan (Tonga) Turkish Turkish (Cyprus) Turkish (Turkey) Turkmen Turkmen (Turkmenistan) Ukrainian Ukrainian (Ukraine) Urdu Urdu (India) Urdu (Pakistan) Uyghur Uyghur (China) Uzbek Uzbek (Afghanistan) Uzbek (Arabic, Afghanistan) Uzbek (Arabic) Uzbek (Cyrillic, Uzbekistan) Uzbek (Cyrillic) Uzbek (Latin, Uzbekistan) Uzbek (Latin) Uzbek (Uzbekistan) Vietnamese Vietnamese (Vietnam) Welsh Welsh (United Kingdom) Western Frisian Western Frisian (Netherlands) Wolof Wolof (Senegal) Xhosa Xhosa (South Africa) Yiddish Yiddish (World) Yoruba Yoruba (Benin) Yoruba (Nigeria) Zulu Zulu (South Africa) ```

Saleor

Saleor is built on the Django framework, which comes with support for locales in the format en-gb:

Represents the name of a language. Browsers send the names of the languages they accept in the Accept-Language HTTP header using this format. Examples: it, de-at, es, pt-br. Language codes are generally represented in lowercase, but the HTTP Accept-Language header is case-insensitive. The separator is a dash.

HTTP Accept-Language header

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language

<language> A language tag (which is sometimes referred to as a "locale identifier"). This consists of a 2-3 letter base language tag representing the language, optionally followed by additional subtags separated by '-'. The most common extra information is the country or region variant (like 'en-US' or 'fr-CA') or the type of alphabet to use (like 'sr-Latn'). Other variants like the type of orthography ('de-DE-1996') are usually not used in the context of this header.

Intl.getCanonicalLocales()

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/getCanonicalLocales

console.log(Intl.getCanonicalLocales('EN-US'));
// expected output: Array ["en-US"]

console.log(Intl.getCanonicalLocales(['EN-US', 'Fr']));
// expected output: Array ["en-US", "fr"]

try {
  Intl.getCanonicalLocales('EN_US');
} catch (err) {
  console.log(err);
  // expected output: RangeError: invalid language tag: EN_US
}

michaelbromley commented 4 years ago

@nbezalwar

IMO you should change the enum from LanguageCode to LocaleCode . A locale is a combination of language + country and is usually of the xx-XX format. eg. en-US or en-GB.

That seems to cover most cases, but e.g. with Chinese we have 2 dimensions: the script and the location. So we can have:

zh-Hans-HK = Chinese as spoken in Hong Kong, written in simplified script
zh-Hant-HK = Chinese as spoken in Hong Kong, written in traditional script

so that all the Chinese variations end up as:

Chinese
Chinese (Hong Kong SAR China)
Chinese (Macao SAR China)
Chinese (Simplified, China)
Chinese (Simplified, Hong Kong SAR China)
Chinese (Simplified, Macao SAR China)
Chinese (Simplified, Singapore)
Chinese (Simplified)
Chinese (Singapore)
Chinese (Taiwan)
Chinese (Traditional, Hong Kong SAR China)
Chinese (Traditional, Macao SAR China)
Chinese (Traditional, Taiwan)
Chinese (Traditional)

On top of that, a Chinese speaker in Hong Kong may want to translate into Cantonese (zh-yue) which is then another dimension!

I'm sure there are other important cases where the en-GB model breaks down, given the rich variety of human languages and history!

However, for practical purposes, the en-GB model will probably suffice. Otherwise it becomes a bit of a rabbit hole, trying to support all combinations of the world's cultural diversity, which might not be time well spent compared to getting the core features finished :D

michaelbromley commented 4 years ago

Solution

So deciding to use the <language>-<region> system then presents the question of which combinations to include.

Covering every possibility would involve (~175 ISO 639-1 codes) * (~250 ISO 3166 region codes) = over 43,000 combinations. The problem is stated nicely in this StackOverflow answer:

This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).

So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.

Such a pragmatic list could be obtained from this Unicode CLDR summary chart, taking all rows from the table labeled Locale Display Names | Languages, having a 2-letter language code with an optional region part.

This can be expressed as Javascript that can be run directly on the linked page:

Array.from(document.querySelectorAll('.body table')[1].querySelectorAll('tr'))
.filter(row => { 
  const pageCell = row.querySelector('td:nth-child(3)');
  return pageCell && pageCell.innerText.startsWith('Languages ') 
})
.filter(row => {
    const codeCell = row.querySelector('td:nth-child(7)');
    return codeCell && codeCell.innerText.match(/^[a-z]{2}(_[A-Za-z]+)?$/)
})
.map(row => {
    const nameCell = row.querySelector('td:nth-child(6)');
    const codeCell = row.querySelector('td:nth-child(7)');
    return `${codeCell.innerText}: ${nameCell.innerText}`;
}).join('\n')

which yields the following list of 157 languages:

af: Afrikaans
ak: Akan
sq: Albanian
am: Amharic
ar: Arabic
hy: Armenian
as: Assamese
az: Azerbaijani
bm: Bambara
bn: Bangla
eu: Basque
be: Belarusian
bs: Bosnian
br: Breton
bg: Bulgarian
my: Burmese
ca: Catalan
ce: Chechen
zh: Chinese
zh_Hans: Simplified Chinese
zh_Hant: Traditional Chinese
cu: Church Slavic
kw: Cornish
co: Corsican
hr: Croatian
cs: Czech
da: Danish
nl: Dutch
nl_BE: Flemish
dz: Dzongkha
en: English
en_AU: Australian English
en_CA: Canadian English
en_GB: British English
en_US: American English
eo: Esperanto
et: Estonian
ee: Ewe
fo: Faroese
fi: Finnish
fr: French
fr_CA: Canadian French
fr_CH: Swiss French
ff: Fulah
gl: Galician
lg: Ganda
ka: Georgian
de: German
de_AT: Austrian German
de_CH: Swiss High German
el: Greek
gu: Gujarati
ht: Haitian Creole
ha: Hausa
he: Hebrew
hi: Hindi
hu: Hungarian
is: Icelandic
ig: Igbo
id: Indonesian
ia: Interlingua
ga: Irish
it: Italian
ja: Japanese
jv: Javanese
kl: Kalaallisut
kn: Kannada
ks: Kashmiri
kk: Kazakh
km: Khmer
ki: Kikuyu
rw: Kinyarwanda
ko: Korean
ku: Kurdish
ky: Kyrgyz
lo: Lao
la: Latin
lv: Latvian
ln: Lingala
lt: Lithuanian
lu: Luba-Katanga
lb: Luxembourgish
mk: Macedonian
mg: Malagasy
ms: Malay
ml: Malayalam
mt: Maltese
gv: Manx
mi: Maori
mr: Marathi
mn: Mongolian
ne: Nepali
nd: North Ndebele
se: Northern Sami
nb: Norwegian Bokmål
nn: Norwegian Nynorsk
ny: Nyanja
or: Odia
om: Oromo
os: Ossetic
ps: Pashto
fa: Persian
fa_AF: Dari
pl: Polish
pt: Portuguese
pt_BR: Brazilian Portuguese
pt_PT: European Portuguese
pa: Punjabi
qu: Quechua
ro: Romanian
ro_MD: Moldavian
rm: Romansh
rn: Rundi
ru: Russian
sm: Samoan
sg: Sango
sa: Sanskrit
gd: Scottish Gaelic
sr: Serbian
sn: Shona
ii: Sichuan Yi
sd: Sindhi
si: Sinhala
sk: Slovak
sl: Slovenian
so: Somali
st: Southern Sotho
es: Spanish
es_ES: European Spanish
es_MX: Mexican Spanish
su: Sundanese
sw: Swahili
sw_CD: Congo Swahili
sv: Swedish
tg: Tajik
ta: Tamil
tt: Tatar
te: Telugu
th: Thai
bo: Tibetan
ti: Tigrinya
to: Tongan
tr: Turkish
tk: Turkmen
uk: Ukrainian
ur: Urdu
ug: Uyghur
uz: Uzbek
vi: Vietnamese
vo: Volapük
cy: Welsh
fy: Western Frisian
wo: Wolof
xh: Xhosa
yi: Yiddish
yo: Yoruba
zu: Zulu

jonyw4 commented 4 years ago

Great solution! Just to complement the prior art: Some projects even aims to support the most 100 most popular locales, like Material-UI and all are publish in Crowdin to everyone can suggest translations.

nbezalwar commented 4 years ago

IMO, this is a good list of 125 languages to support. https://www.andiamo.co.uk/resources/iso-language-codes/

For Chinese, zh-CN and zh-TW are the 2 variants for simplified and traditional chinese.

vendure-ecommerce / vendure