scribe-org / Scribe-Data

Wikidata, Wiktionary and Wikipedia language data extraction
GNU General Public License v3.0
30 stars 69 forks source link

Create English to all other languages translation process #72

Closed andrewtavis closed 8 months ago

andrewtavis commented 8 months ago

Terms

Description

The goal of this issue is to create a process whereby a single file is used to translate all words within English/translations/words_to_translate.json to all other Scribe languages. To achieve this we'll be using m2m100_418M, with the output being a JSON file that has a string and keyed values for each language. This can then be transferred to an SQLite database table with each string in an index corresponding to a column value for each language.

Of specific importance is trying to get a metric of the accuracy of the translation and doing a cutoff such that we're no longer including low quality translations in Scribe applications :)

Contribution

Happy to work on this as a first example of the translation process or support someone with interest in working on it, and then others can take over for the other languages!

Linfye commented 8 months ago

Can you tell me more information? Where should I put the output data, still in the same dir as the script? And Can you please show me a standard output json format? @andrewtavis

andrewtavis commented 8 months ago

Hey @Linfye 👋 Thanks for reaching out :) Generally the goal for this would be that the data would go into the English formatted_data directory. As far as the the output is concerned, it would be something like this:

[
{
"word": {
{
"fr": "word_in_french",
"de": "word_in_german",
...
}
}
},
{
...
}
]

The reason that there's no whitespace is to keep the file size down :) Do you have experience with Huggingface or this sort of task yet, @Linfye? What about Google Colab? Happy to write up some pseudocode for it if that would help!

Linfye commented 8 months ago

@andrewtavis No, I never worked with Huggingface but I did use Google Colab for some ML assignments. And I wonder if all languages in the directory is needed or just the current APP supported languages?

andrewtavis commented 8 months ago

Thanks for the PR already, @Linfye! We'll check this out soon. As of now it would be for the other languages in the repo only, and we'll expand it out as needed. Hope the Hugging Face experience was a fun one!

cc @wkyoshida