We should add names for more languages, especially:
languages with a 2-letter code
languages supported by more than 1 API
But could we just get the names for all the codes programmatically, from Wikipedia?
The additional metadata like language families, scripts and typology could be filled in later - though even some of those are available in a structured way from Wikipedia.
3. Check that codes exist and are correct
We added zaz for Baidu - https://machinetranslate.org/zaz - but if you click the link to Wikipedia, there is no such code. If we click the link to Baidu's docs, it lists:
扎扎其语 zaz
If I translate that, it's actually for Zazaki (Zaza Kurdish). The correct code for Zazaki is zza, and we have an article on that, but someone going there would miss the fact that Baidu supports it!
So in this case we just need to make an addition to api_language.json:
'baidu': {
'zaz': 'zza'
}
But what we really need is a way to check for basic errors like this across all the codes.
4. Add coverage of pre-trained models
e.g. languages like Latgalian are not supported by any API yet, but are supported by pretrained models like NLLB. We could add models.json, and display that info in the language's article.
Just get all language codes and names, and make sure the codes we already have are actually valid.
1. Add more languages
e.g.
ltg
for Latgalian2. Add names for more languages
We should add names for more languages, especially:
3. Check that codes exist and are correct
We added
zaz
for Baidu - https://machinetranslate.org/zaz - but if you click the link to Wikipedia, there is no such code. If we click the link to Baidu's docs, it lists:4. Add coverage of pre-trained models
e.g. languages like Latgalian are not supported by any API yet, but are supported by pretrained models like NLLB. We could add models.json, and display that info in the language's article.