Open ivanistheone opened 7 years ago
related TODO: make sure TED language lookup is mappable to LE language names:
中文 (简体)
Chinese, Simplified
中文 (繁體)
Chinese, Traditional
Hrvatski
Croatian
български
Bulgarian
日本語
Japanese
한국어
Korean
کوردی
Kurdish
فارسى
Persian
Polski
Polish
Português de Portugal
Portuguese
Português brasileiro
Portuguese, Brazilian
Română
Romanian
Русский
Russian
Српски, Srpski
Serbian
Slovenčina
Slovak
Español
Spanish
Türkçe
Turkish
Tiếng Việt
Vietnamese
In the choice of sw
or swa
as language code for a new channel, I have to assume one of them. Both language codes are listed on Studio. It's unclear from sushi-chef-khan-academy which language code is used, because it is inputted as a command line argument.
This seems to be because the language list is used in the wrong way... is this the misunderstanding?
1) It was meant to be for looking up a language code from another origin to determine which language it is?
2) Now this list is used to list the selectable languages in Studio, meaning that the ambiguity of language codes has propagated to Studio.
Maybe we should maintain a separate list of unique language codes without the ambiguity?
The sw
code is deprecated and shouldn't be used. It is there for compatibility reasons (i.e. don't break existing content with that tag).
The recommended approach for looking up language codes is using the getlang_by_name
and getlang_by_native_name
lang_obj = getlang_by_name('Swahili')
lang_obj.code
## 'swa'
lang_obj = getlang_by_native_name('Kiswahili')
lang_obj.code
## 'swa'
There is also getlang_by_alpha2
used to map ISO 639-1 codes to the le-utils internal representaiton.
This seems to be because the language list is used in the wrong way... is this the misunderstanding? It was meant to be for looking up a language code from another origin to determine which language it is?
The language codes in le-utils are referred to as the "internal representation" — a far-from-perfect, and far-from-consisten convention that is used by Ricecooker, Studio, and Kolibri. All external language codes must be mapped to one of the internal codes upon "entering" the Kolibri ecosystem. We maintain le-utils language codes partially out of compatibility (for channels already out there) can learn all about the different work and helpers and utils for doing this mapping of external-to-internal the docs, the examples, or the tests where you'll see various non-standard external language codes we handle [KA lang codes, youtube, ISOs, native name variations, etc].
Next steps on this issue:
The sw
has 1745 nodes:
from contentcuration.models import ContentNode
sw_count = ContentNode.objects.filter(language_id='sw').count()
swa_count = ContentNode.objects.exclude(language_id='swa').count()
print('sw count = ', sw_count)
print('swa count = ', swa_count)
## ('sw count = ', 1745)
## ('swa count = ', 9670630)
let's hope none of them have been published otherwise we'll have to keep sw
in.
For st
there are very few:
st_count = ContentNode.objects.filter(language_id='st').count()
sot_count = ContentNode.objects.exclude(language_id='sot').count()
print('st count = ', st_count)
print('sot count = ', sot_count)
## ('st count = ', 296)
## ('sot count = ', 9670875)
Looking through
languagelookup.json
I found some inconsistencies to langauges:"name":"Panjabi, Punjabi", should be "name":"Panjabi; Punjabi" These can be fixed manually (change , to ; )
Some languages have two different internal representation codes:
Might be a good idea to remove duplicates --- check if exist in CCServer before removing.