Duplicate languages and inconsistencies in languagelookup.json

ivanistheone commented 7 years ago

Looking through languagelookup.json I found some inconsistencies to langauges:

"name":"Panjabi, Punjabi", should be "name":"Panjabi; Punjabi" These can be fixed manually (change , to ; )
Some languages have two different internal representation codes:
```
  "st":{   "name":"Southern Sotho", "native_name":"Sesotho"  },
  "sot":{ "name":"Southern Sotho", "native_name":"Sesotho"  },
```
Might be a good idea to remove duplicates --- check if exist in CCServer before removing.

ivanistheone commented 6 years ago

related TODO: make sure TED language lookup is mappable to LE language names:


中文 (简体)
Chinese, Simplified

中文 (繁體)
Chinese, Traditional

Hrvatski
Croatian

български
Bulgarian

日本語
Japanese

한국어
Korean

کوردی
Kurdish

فارسى
Persian

Polski
Polish

Português de Portugal
Portuguese

Português brasileiro
Portuguese, Brazilian

Română
Romanian

Русский
Russian

Српски, Srpski
Serbian

Slovenčina
Slovak

Español
Spanish

Türkçe
Turkish

Tiếng Việt
Vietnamese

benjaoming commented 4 years ago

In the choice of sw or swa as language code for a new channel, I have to assume one of them. Both language codes are listed on Studio. It's unclear from sushi-chef-khan-academy which language code is used, because it is inputted as a command line argument.

This seems to be because the language list is used in the wrong way... is this the misunderstanding?

1) It was meant to be for looking up a language code from another origin to determine which language it is?

2) Now this list is used to list the selectable languages in Studio, meaning that the ambiguity of language codes has propagated to Studio.

Maybe we should maintain a separate list of unique language codes without the ambiguity?

ivanistheone commented 4 years ago

The sw code is deprecated and shouldn't be used. It is there for compatibility reasons (i.e. don't break existing content with that tag).

The recommended approach for looking up language codes is using the getlang_by_name and getlang_by_native_name

lang_obj = getlang_by_name('Swahili')
lang_obj.code
## 'swa'

lang_obj = getlang_by_native_name('Kiswahili')
lang_obj.code
## 'swa'

There is also getlang_by_alpha2 used to map ISO 639-1 codes to the le-utils internal representaiton.

This seems to be because the language list is used in the wrong way... is this the misunderstanding? It was meant to be for looking up a language code from another origin to determine which language it is?

The language codes in le-utils are referred to as the "internal representation" — a far-from-perfect, and far-from-consisten convention that is used by Ricecooker, Studio, and Kolibri. All external language codes must be mapped to one of the internal codes upon "entering" the Kolibri ecosystem. We maintain le-utils language codes partially out of compatibility (for channels already out there) can learn all about the different work and helpers and utils for doing this mapping of external-to-internal the docs, the examples, or the tests where you'll see various non-standard external language codes we handle [KA lang codes, youtube, ISOs, native name variations, etc].

Next steps on this issue:

Audit le-utils list to find all duplicates
Add "redirects" in getlang an other utils functions: sw --> swa, st --> sot so no new content will use these
Change language codes in existing nodes in Studio (assuming they have not been pulished)

Some numbers

The sw has 1745 nodes:


from contentcuration.models import ContentNode

sw_count = ContentNode.objects.filter(language_id='sw').count()
swa_count = ContentNode.objects.exclude(language_id='swa').count()

print('sw count = ', sw_count)
print('swa count = ', swa_count)

## ('sw count = ', 1745)
## ('swa count = ', 9670630)

let's hope none of them have been published otherwise we'll have to keep sw in.

For st there are very few:

st_count = ContentNode.objects.filter(language_id='st').count()
sot_count = ContentNode.objects.exclude(language_id='sot').count()

print('st count = ', st_count)
print('sot count = ', sot_count)

## ('st count = ', 296)
## ('sot count = ', 9670875)

learningequality / le-utils

Duplicate languages and inconsistencies in languagelookup.json #30

Some numbers