scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
30 stars 69 forks source link

Create German to all other languages translation process #74

Closed andrewtavis closed 7 months ago

andrewtavis commented 8 months ago

Terms

Description

The goal of this issue is to create a process whereby a single file is used to translate all words within German/translations/words_to_translate.json to all other Scribe languages. To achieve this we'll be using m2m100_418M, with the output being a JSON file that has a string and keyed values for each language. This can then be transferred to an SQLite database table with each string in an index corresponding to a column value for each language.

Of specific importance is trying to get a metric of the accuracy of the translation and doing a cutoff such that we're no longer including low quality translations in Scribe applications :)

Contribution

Happy to work on this or support someone with interest in working on it!

mhmohona commented 8 months ago

Can I work on this issue?

andrewtavis commented 8 months ago

You certainly can, @mhmohona. Let me just merge one of the other ones so that we can use it as a reference for the others so we have some consistency :) Could you let me know what your Python experience is as well?

andrewtavis commented 8 months ago

No need to fill me in on the Python experience, @mhmohona 😊 Looks great based on your profile! Again let me merge one in, but your feedback on the process would be very welcome!

andrewtavis commented 8 months ago

Hey @mhmohona 👋 The process has been set up and we're ready to implement here :) It's actually quite streamlined now. If you make a version of scribe_data/extract_transform/languages/English/translations/translate_words.py that replaces SRC_LANG with German we should be good to go here 😊

Give it a test to see if it's working on your end by running the script in the header and letting it run for one batch so we can see what comes out!

mhmohona commented 8 months ago

Thanks for letting know @andrewtavis!

mhmohona commented 8 months ago

@andrewtavis, so I tried to run the scribe_data/extract_transform/languages/English/translations/translate_words.py file and got following error in my device - image

From google colab - image

andrewtavis commented 8 months ago

Have you changed the SRC_LANG to "German" in the file?

mhmohona commented 8 months ago

Yes, Its my script - https://github.com/mhmohona/Scribe-Data/blob/german/src/scribe_data/extract_transform/languages/German/translations/translate_words.py

mhmohona commented 8 months ago

ok, I found the problem. After adjusting parameters it got fixed. Shall I submit the script with my adjustment or the original one(english script). Also including Parallel Processing can improve the runtime. Can I update the script accordingly @andrewtavis?

andrewtavis commented 7 months ago

Sounds great, @mhmohona! Let's do a PR for just the German translation script and I'll update the other ones after :) Thank you!