internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.08k stars 1.32k forks source link

Add translated labels for 263 missing languages #8138

Open tfmorris opened 1 year ago

tfmorris commented 1 year ago

There are 263 language codes which are missing translations (ie no name_translated object in the JSON language data). At least the high frequency ones should be translated into the languages that OpenLibrary supports. There are two with more than 10,000 books, another 10 with 1K-10K books, and 71 with 100-1,000 books.

I'll attach a full list, but the top ten are below. Note that cmn isn't a valid MARC language code (it's an ISO 639-3 code), so it may be part of a bigger problem. I've filtered out those which are missing translations but are retired codes which should be migrated.

Code Language Books
und Undetermined 93240
cmn Mandarin 75 735
mul Multiple languages 37768
grc Greek, Ancient 8372
ota Turkish, Ottoman 5460
gem Germanic (Other) 4339
raj Rajasthani 1660
enm English, Middle (1100-1500) 1308
mai Maithili 1199
kok Konkani 1122
new Newari 1076

Relevant url?

https://openlibrary.org/languages/und.json https://openlibrary.org/languages/cmn.json

Related files

OpenLibrary-languages-missing-labels.csv

Stakeholders

@mekarpeles

tfmorris commented 1 year ago

As a stop gap for code cmn (which isn't a valid MARC language code), the labels for https://www.wikidata.org/wiki/Q9192 can be used. Spot checking a few of the other codes which don't correspond to a particular language, like und, mul, gem, and zxx shows that they are in Wikidata and have at least some translated labels.