google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.79k stars 9.56k forks source link

Multilingual model supported language codes in machine readable format #1003

Open bittlingmayer opened 4 years ago

bittlingmayer commented 4 years ago

Would be ideal to add the list of actual language codes to the multilingual README, so other systems can do lookups, and to prevent bugs where there is a discrepancy between the internal list and the documentation:

af, sq, ar, an, hy, ast, az, ...

Instead of only:

Afrikaans Albanian Arabic Aragonese Armenian Asturian Azerbaijani ...

bittlingmayer commented 4 years ago

I sent a PR https://github.com/google-research/bert/pull/1051

bittlingmayer commented 4 years ago
BERT_LANGS = [ 
    'af', 'sq', 'ar', 'an', 'hy', 'ast', 'az', 'ba', 'eu', 'bar', 'be', 'bn', 'bpy', 'bs', 'br', 'bg', 'my', 'ca', 'ceb', 'ce', 'zh', 'zh-tw', 'cv', 'hr', 'cs', 'da', 'nl', 'en', 'et', 'fi', 'fr', 'gl', 'ka', 'de', 'el', 'gu', 'ht', 'he', 'hi', 'hu', 'is', 'io', 'id', 'ga', 'it', 'ja', 'jv', 'kn', 'kk', 'ky', 'ko', 'la', 'lv', 'lt', 'lmo', 'nds', 'lb', 'mk', 'mg', 'ms', 'ml', 'mn', 'mr', 'min', 'ne', 'new', 'nb', 'nn', 'oc', 'fa', 'pms', 'pl', 'pt', 'pa', 'ro', 'ru', 'sco', 'sr', 'sh', 'scn', 'sk', 'sl', 'azb', 'es', 'su', 'sw', 'sv', 'tl', 'tg', 'ta', 'tt', 'te', 'th', 'tr', 'uk', 'ur', 'uz', 'vi', 'vo', 'war', 'cy', 'fy', 'lah', 'yo'
]
Wikipedia Name Notes
af Afrikaans
sq Albanian
ar Arabic
an Aragonese
hy Armenian
ast Asturian
az Azerbaijani
ba Bashkir
eu Basque
bar Bavarian
be Belarusian
bn Bengali
bpy Bishnupriya Manipuri
bs Bosnian
br Breton
bg Bulgarian
my Burmese
ca Catalan
ceb Cebuano
ce Chechen
zh Chinese (Simplified) zh-CN
zh-tw Chinese (Traditional) zh-HK, zh-MO
cv Chuvash
hr Croatian
cs Czech
da Danish
nl Dutch
en English
et Estonian
fi Finnish
fr French
gl Galician
ka Georgian
de German
el Greek
gu Gujarati
ht Haitian
he Hebrew Previously iw
hi Hindi
hu Hungarian
is Icelandic
io Ido
id Indonesian Previously in
ga Irish
it Italian
ja Japanese
jv Javanese Previously jw
kn Kannada
kk Kazakh
ky Kirghiz
ko Korean
la Latin
lv Latvian
lt Lithuanian
lmo Lombard
nds Low Saxon
lb Luxembourgish
mk Macedonian
mg Malagasy
ms Malay
ml Malayalam
mn Mongolian in Multilingual Cased (New) only
mr Marathi
min Minangkabau
ne Nepali
new Newar
nb Norwegian (Bokmal) Also no
nn Norwegian (Nynorsk)
oc Occitan
fa Persian (Farsi)
pms Piedmontese
pl Polish
pt Portuguese
pa Punjabi
ro Romanian
ru Russian
sco Scots
sr Serbian
sh Serbo-Croatian
scn Sicilian
sk Slovak
sl Slovenian
azb South Azerbaijani
es Spanish
su Sundanese
sw Swahili
sv Swedish
tl Tagalog Macrolanguage fil
tg Tajik
ta Tamil
tt Tatar
te Telugu
th Thai in Multilingual Cased (New) only
tr Turkish
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
vo Volapük
war Waray-Waray
cy Welsh
fy West Frisian Macrolanguage fry
lah Western Punjabi Macrolanguage pan
yo Yoruba