scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
23 stars 25 forks source link

Create Spanish to all other languages translation process #78

Closed andrewtavis closed 6 months ago

andrewtavis commented 7 months ago

Terms

Description

The goal of this issue is to create a process whereby a single file is used to translate all words within Spanish/translations/words_to_translate.json to all other Scribe languages. To achieve this we'll be using m2m100_418M, with the output being a JSON file that has a string and keyed values for each language. This can then be transferred to an SQLite database table with each string in an index corresponding to a column value for each language.

Of specific importance is trying to get a metric of the accuracy of the translation and doing a cutoff such that we're no longer including low quality translations in Scribe applications :)

Contribution

Happy to work on this or support someone with interest in working on it!

henrikth93 commented 6 months ago

I am interested in this.

andrewtavis commented 6 months ago

Thanks @henrikth93! 🥳

andrewtavis commented 6 months ago

Hey @henrikth93 👋 The process has been set up and we're ready to implement here :) It's actually quite streamlined now. If you make a version of scribe_data/extract_transform/languages/English/translations/translate_words.py that replaces SRC_LANG with Spanish we should be good to go here 😊

Give it a test to see if it's working on your end by running the script in the header and letting it run for one batch so we can see what comes out!

andrewtavis commented 6 months ago

Closed via #118 🥳 Thank you, @henrikth93!