machinetranslate / machinetranslate.org

Open information and community for machine translation
https://machinetranslate.org
Creative Commons Attribution Share Alike 4.0 International
71 stars 56 forks source link

Better coverage of very low-resource languages #566

Open bittlingmayer opened 11 months ago

bittlingmayer commented 11 months ago

Just get all language codes and names, and make sure the codes we already have are actually valid.

1. Add more languages

e.g. ltg for Latgalian

2. Add names for more languages

We should add names for more languages, especially:

3. Check that codes exist and are correct

We added zaz for Baidu - https://machinetranslate.org/zaz - but if you click the link to Wikipedia, there is no such code. If we click the link to Baidu's docs, it lists:

扎扎其语 zaz If I translate that, it's actually for Zazaki (Zaza Kurdish). The correct code for Zazaki is zza, and we have an article on that, but someone going there would miss the fact that Baidu supports it! So in this case we just need to make an addition to api_language.json:

'baidu': {
'zaz': 'zza'
}

But what we really need is a way to check for basic errors like this across all the codes.

4. Add coverage of pre-trained models

e.g. languages like Latgalian are not supported by any API yet, but are supported by pretrained models like NLLB. We could add models.json, and display that info in the language's article.

bittlingmayer commented 11 months ago

NLLB languages and codes: https://www.reddit.com/r/machinetranslation/comments/17jdoup/question_latgalian/k7hc17t/

bittlingmayer commented 11 months ago

The PR https://github.com/machinetranslate/machinetranslate.org/pull/572 for issues https://github.com/machinetranslate/machinetranslate.org/issues/571 and https://github.com/machinetranslate/machinetranslate.org/issues/567 addresses some of this.