Closed michaelbromley closed 4 years ago
IMO you should change the enum from LanguageCode to LocaleCode . A locale is a combination of language + country and is usually of the xx-XX format. eg. en-US or en-GB.
This also serves as a mechanism to showcase different products in your store based on country. Happy to discuss more details on slack.
https://www.w3.org/International/articles/language-tags/
The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
As far as I can tell, Sylius uses the Symphony Intl package, and on https://demo.sylius.com/admin they list 593 locales:
Saleor is built on the Django framework, which comes with support for locales in the format en-gb
:
Represents the name of a language. Browsers send the names of the languages they accept in the Accept-Language HTTP header using this format. Examples: it, de-at, es, pt-br. Language codes are generally represented in lowercase, but the HTTP Accept-Language header is case-insensitive. The separator is a dash.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-Language
<language>
A language tag (which is sometimes referred to as a "locale identifier"). This consists of a 2-3 letter base language tag representing the language, optionally followed by additional subtags separated by '-'. The most common extra information is the country or region variant (like 'en-US' or 'fr-CA') or the type of alphabet to use (like 'sr-Latn'). Other variants like the type of orthography ('de-DE-1996') are usually not used in the context of this header.
console.log(Intl.getCanonicalLocales('EN-US'));
// expected output: Array ["en-US"]
console.log(Intl.getCanonicalLocales(['EN-US', 'Fr']));
// expected output: Array ["en-US", "fr"]
try {
Intl.getCanonicalLocales('EN_US');
} catch (err) {
console.log(err);
// expected output: RangeError: invalid language tag: EN_US
}
@nbezalwar
IMO you should change the enum from LanguageCode to LocaleCode . A locale is a combination of language + country and is usually of the xx-XX format. eg. en-US or en-GB.
That seems to cover most cases, but e.g. with Chinese we have 2 dimensions: the script and the location. So we can have:
zh-Hans-HK
= Chinese as spoken in Hong Kong, written in simplified scriptzh-Hant-HK
= Chinese as spoken in Hong Kong, written in traditional scriptso that all the Chinese variations end up as:
Chinese
Chinese (Hong Kong SAR China)
Chinese (Macao SAR China)
Chinese (Simplified, China)
Chinese (Simplified, Hong Kong SAR China)
Chinese (Simplified, Macao SAR China)
Chinese (Simplified, Singapore)
Chinese (Simplified)
Chinese (Singapore)
Chinese (Taiwan)
Chinese (Traditional, Hong Kong SAR China)
Chinese (Traditional, Macao SAR China)
Chinese (Traditional, Taiwan)
Chinese (Traditional)
On top of that, a Chinese speaker in Hong Kong may want to translate into Cantonese (zh-yue
) which is then another dimension!
I'm sure there are other important cases where the en-GB
model breaks down, given the rich variety of human languages and history!
However, for practical purposes, the en-GB
model will probably suffice. Otherwise it becomes a bit of a rabbit hole, trying to support all combinations of the world's cultural diversity, which might not be time well spent compared to getting the core features finished :D
So deciding to use the <language>-<region>
system then presents the question of which combinations to include.
Covering every possibility would involve (~175 ISO 639-1 codes) * (~250 ISO 3166 region codes) = over 43,000 combinations. The problem is stated nicely in this StackOverflow answer:
This means a very large number of possible combinations, most of which make little sense, like ab-AX, which means Abkhaz as spoken in Åland (I don’t think anyone, still less any community, speaks Abkhaz there, though it is theoretically possible of course).
So any list of language-region combinations would be just a pragmatic list of combinations that are important in some sense, or supported by some software in some special sense.
Such a pragmatic list could be obtained from this Unicode CLDR summary chart, taking all rows from the table labeled Locale Display Names | Languages
, having a 2-letter language code with an optional region part.
This can be expressed as Javascript that can be run directly on the linked page:
Array.from(document.querySelectorAll('.body table')[1].querySelectorAll('tr'))
.filter(row => {
const pageCell = row.querySelector('td:nth-child(3)');
return pageCell && pageCell.innerText.startsWith('Languages ')
})
.filter(row => {
const codeCell = row.querySelector('td:nth-child(7)');
return codeCell && codeCell.innerText.match(/^[a-z]{2}(_[A-Za-z]+)?$/)
})
.map(row => {
const nameCell = row.querySelector('td:nth-child(6)');
const codeCell = row.querySelector('td:nth-child(7)');
return `${codeCell.innerText}: ${nameCell.innerText}`;
}).join('\n')
which yields the following list of 157 languages:
af: Afrikaans
ak: Akan
sq: Albanian
am: Amharic
ar: Arabic
hy: Armenian
as: Assamese
az: Azerbaijani
bm: Bambara
bn: Bangla
eu: Basque
be: Belarusian
bs: Bosnian
br: Breton
bg: Bulgarian
my: Burmese
ca: Catalan
ce: Chechen
zh: Chinese
zh_Hans: Simplified Chinese
zh_Hant: Traditional Chinese
cu: Church Slavic
kw: Cornish
co: Corsican
hr: Croatian
cs: Czech
da: Danish
nl: Dutch
nl_BE: Flemish
dz: Dzongkha
en: English
en_AU: Australian English
en_CA: Canadian English
en_GB: British English
en_US: American English
eo: Esperanto
et: Estonian
ee: Ewe
fo: Faroese
fi: Finnish
fr: French
fr_CA: Canadian French
fr_CH: Swiss French
ff: Fulah
gl: Galician
lg: Ganda
ka: Georgian
de: German
de_AT: Austrian German
de_CH: Swiss High German
el: Greek
gu: Gujarati
ht: Haitian Creole
ha: Hausa
he: Hebrew
hi: Hindi
hu: Hungarian
is: Icelandic
ig: Igbo
id: Indonesian
ia: Interlingua
ga: Irish
it: Italian
ja: Japanese
jv: Javanese
kl: Kalaallisut
kn: Kannada
ks: Kashmiri
kk: Kazakh
km: Khmer
ki: Kikuyu
rw: Kinyarwanda
ko: Korean
ku: Kurdish
ky: Kyrgyz
lo: Lao
la: Latin
lv: Latvian
ln: Lingala
lt: Lithuanian
lu: Luba-Katanga
lb: Luxembourgish
mk: Macedonian
mg: Malagasy
ms: Malay
ml: Malayalam
mt: Maltese
gv: Manx
mi: Maori
mr: Marathi
mn: Mongolian
ne: Nepali
nd: North Ndebele
se: Northern Sami
nb: Norwegian Bokmål
nn: Norwegian Nynorsk
ny: Nyanja
or: Odia
om: Oromo
os: Ossetic
ps: Pashto
fa: Persian
fa_AF: Dari
pl: Polish
pt: Portuguese
pt_BR: Brazilian Portuguese
pt_PT: European Portuguese
pa: Punjabi
qu: Quechua
ro: Romanian
ro_MD: Moldavian
rm: Romansh
rn: Rundi
ru: Russian
sm: Samoan
sg: Sango
sa: Sanskrit
gd: Scottish Gaelic
sr: Serbian
sn: Shona
ii: Sichuan Yi
sd: Sindhi
si: Sinhala
sk: Slovak
sl: Slovenian
so: Somali
st: Southern Sotho
es: Spanish
es_ES: European Spanish
es_MX: Mexican Spanish
su: Sundanese
sw: Swahili
sw_CD: Congo Swahili
sv: Swedish
tg: Tajik
ta: Tamil
tt: Tatar
te: Telugu
th: Thai
bo: Tibetan
ti: Tigrinya
to: Tongan
tr: Turkish
tk: Turkmen
uk: Ukrainian
ur: Urdu
ug: Uyghur
uz: Uzbek
vi: Vietnamese
vo: Volapük
cy: Welsh
fy: Western Frisian
wo: Wolof
xh: Xhosa
yi: Yiddish
yo: Yoruba
zu: Zulu
Great solution! Just to complement the prior art: Some projects even aims to support the most 100 most popular locales, like Material-UI and all are publish in Crowdin to everyone can suggest translations.
IMO, this is a good list of 125 languages to support. https://www.andiamo.co.uk/resources/iso-language-codes/
For Chinese, zh-CN and zh-TW are the 2 variants for simplified and traditional chinese.
A recent pull request #354 adds support for Traditional Chinese (as spoken in HK and Taiwan). We already have Simplified Chinese translations. This exposes a shortcoming in our language handling: we currently only list the ISO 639-1 codes for languages, without distinguishing between regional variations.
So, e.g. we have the
LanguageCode
enum which is widely used throughout the core, which only has "zh" for all variations of Chinese.We could add support for common IETF language tags (which is have used in this PR), which would then allow us to support the
zh-CN
andzh-TW
UI translations, as well as in places such as product translations etc.The question then is "which full IETF tags to support?"
Add all possible IETF tags
Cases like Chinese are interesting because the actual writing system is distinct between Simplified and Traditional. Whereas an American can totally understand anything written in British English and vice-versa, does the same apply to Chinese speakers in Beijing and Taipei?
How about Madrid (es-ES) and Mexico City (es-MX)? I personally cannot answer these questions, since I am only very familiar with English (and to a lesser extent German).
Adding all variations seems like a bad idea, e.g. according to https://datahub.io/core/language-codes#resource-ietf-language-tags there are over 100 variations of English and over 40 variations of French. In practice, a single "English" version of anything would probably suffice (although I say this as an English speaker from England, so perhaps my perspective is too narrow). In any case, I'm pretty sure that listing 100 variations of English in UI menus is not a desirable solution.
Add tags on an ad-hoc basis
Another approach would be to default to the ISO 639-1 codes as we currently have, and only add regional variations as-and-when the need arises (e.g. a pull request like the one that triggered this issue comes in).
In this case, we'd then need to update the
LanguageCode
enum as well as thelanguage-translation-strings.ts
file to allow localization of these new variants.