Closed andrewtavis closed 8 months ago
Can you tell me more information? Where should I put the output data, still in the same dir as the script? And Can you please show me a standard output json format? @andrewtavis
Hey @Linfye 👋 Thanks for reaching out :) Generally the goal for this would be that the data would go into the English formatted_data directory. As far as the the output is concerned, it would be something like this:
[
{
"word": {
{
"fr": "word_in_french",
"de": "word_in_german",
...
}
}
},
{
...
}
]
The reason that there's no whitespace is to keep the file size down :) Do you have experience with Huggingface or this sort of task yet, @Linfye? What about Google Colab? Happy to write up some pseudocode for it if that would help!
@andrewtavis No, I never worked with Huggingface but I did use Google Colab for some ML assignments. And I wonder if all languages in the directory is needed or just the current APP supported languages?
Thanks for the PR already, @Linfye! We'll check this out soon. As of now it would be for the other languages in the repo only, and we'll expand it out as needed. Hope the Hugging Face experience was a fun one!
cc @wkyoshida
Terms
Description
The goal of this issue is to create a process whereby a single file is used to translate all words within English/translations/words_to_translate.json to all other Scribe languages. To achieve this we'll be using m2m100_418M, with the output being a JSON file that has a string and keyed values for each language. This can then be transferred to an SQLite database table with each string in an index corresponding to a column value for each language.
Of specific importance is trying to get a metric of the accuracy of the translation and doing a cutoff such that we're no longer including low quality translations in Scribe applications :)
Contribution
Happy to work on this as a first example of the translation process or support someone with interest in working on it, and then others can take over for the other languages!