Better coverage of very low-resource languages

bittlingmayer commented 11 months ago

Just get all language codes and names, and make sure the codes we already have are actually valid.

1. Add more languages

e.g. ltg for Latgalian

https://www.reddit.com/r/machinetranslation/comments/17jdoup/question_latgalian/
https://machinetranslate.org/ltg (404) Could we just get all the language codes (even if they're not in any API)?

2. Add names for more languages

We should add names for more languages, especially:

languages with a 2-letter code
languages supported by more than 1 API But could we just get the names for all the codes programmatically, from Wikipedia? The additional metadata like language families, scripts and typology could be filled in later - though even some of those are available in a structured way from Wikipedia.

3. Check that codes exist and are correct

We added zaz for Baidu - https://machinetranslate.org/zaz - but if you click the link to Wikipedia, there is no such code. If we click the link to Baidu's docs, it lists:

扎扎其语 zaz If I translate that, it's actually for Zazaki (Zaza Kurdish). The correct code for Zazaki is zza, and we have an article on that, but someone going there would miss the fact that Baidu supports it! So in this case we just need to make an addition to api_language.json:
'baidu': {
'zaz': 'zza'
}
But what we really need is a way to check for basic errors like this across all the codes.

4. Add coverage of pre-trained models

e.g. languages like Latgalian are not supported by any API yet, but are supported by pretrained models like NLLB. We could add models.json, and display that info in the language's article.

bittlingmayer commented 11 months ago

NLLB languages and codes: https://www.reddit.com/r/machinetranslation/comments/17jdoup/question_latgalian/k7hc17t/

bittlingmayer commented 11 months ago

The PR https://github.com/machinetranslate/machinetranslate.org/pull/572 for issues https://github.com/machinetranslate/machinetranslate.org/issues/571 and https://github.com/machinetranslate/machinetranslate.org/issues/567 addresses some of this.

machinetranslate / machinetranslate.org