Closed alexandrtovmach closed 1 year ago
cc @alebourne who has experience working with machine translation.
Thanks Alexandr for the very thorough explanation.
To be clear, we're talking about data which is already public (open https://crowdin.com/project/nodejs/uk in an incognito tab), just not in a convenient format for moving across applications (a single parallel file). Also the CrowdIn files don't include the original English strings, so we would have to some harvest the English from this repo and align them.
TMX, which CrowdIn supports, is designed exactly for this, TSV or JSON also works.
Hi @zeke Thank you for mentioning me. I took a brief look at the modelfront.com site and it sounds interesting to explore. One of the trickiest things in machine translation is that it is suitable for some but not necessarily for all sentences. So, if one could determine which ones are the ones that would provide the best results with MT, that could be quite useful. I would also be interested in attending the meeting if possible. I'd like to see the demo.
Okay, so I'm going to provide our .tmx file to Adam, so he can prepare a demo
@bittlingmayer nodejs.zip
Thanks, the format looks good. I've flattened it out into parallel data files for each pair.
There are 58K English originals (but roughly 1/3 are repeats, although their translations may be different). These look like only the approved translations.
It's fairly clean, there are a few segments that have multiple sentences or newline chars. (It's not trivial to convert these to multiple segments - sentence splitting is not a solved problem, and the number of newline chars in source and target is not always the same.)
Across all the languages, there are 106K segment pairs total.
1638 en.ar.tsv
18 en.bg.tsv
136 en.ca.tsv
3 en.cs.tsv
22 en.da.tsv
1506 en.de.tsv
1705 en.el.tsv
51892 en.es.tsv
184 en.fa.tsv
137 en.fi.tsv
2071 en.fr.tsv
33 en.he.tsv
178 en.hi.tsv
11 en.hr.tsv
16 en.hu.tsv
252 en.id.tsv
16445 en.it.tsv
917 en.ja.tsv
956 en.ko.tsv
1936 en.nl.tsv
1 en.no.tsv
979 en.pl.tsv
5549 en.pt.tsv
558 en.ro.tsv
7479 en.ru.tsv
7 en.sk.tsv
119 en.sr.tsv
1269 en.sv.tsv
34 en.te.tsv
1 en.th.tsv
1098 en.tr.tsv
433 en.uk.tsv
312 en.vi.tsv
7502 en.zh-cn.tsv
266 en.zh-tw.tsv
I'll train a custom translation quality prediction model for you with this now.
We can talk about how to get you quality custom translation even into the languages for which you don't have much data.
For i18n WG Meeting - April 24
We catch bad translations.
en: 'This is not a test.', pt: 'Isto não é uma prova.'
0.011319409
en: 'This is not a test.', pt: 'Isto é uma prova.'
0.9994029
1% risk means the translation is probably good, 99% risk means the translation is probably bad.
Based on deep learning, supports all languages and custom models, machine translation or human translation
API and console (and ML/NLP pipeline) built with NodeJS <3 <3 <3
https://modelfront.com/docs/api/
Support for open projects
Technical content with many tags
Pontoon, machine translation, 100+ languages
Human translations: https://console.modelfront.com/#/evaluations/5e7241fa1e793e0010b4ade5
Machine translations: https://console.modelfront.com/#/evaluations/5e73244aa0d73f001043cc82
https://docs.google.com/spreadsheets/d/1z0Plc7SD8QhjwVo0SK2gJgDoHEWslFCZvziXC7prs7E/view#gid=0
Look at the distribution!
Human translator efficiency
Technical content with many non-translatable segments and tokens
Occasional localisation of code
https://nodejs.org/en/download/ https://nodejs.org/zh-cn/download/
Conflicting translations in existing NodeJS translations
cat en.es.tsv | grep "Child Processes"
Child Processes Procesos Hijos
Child Processes Procesos Secundarios
Not many existing NodeJS translations for training data - 100K total, < 1K for most langs
More on my first look at NodeJS translation data in #319
Initial eval of NodeJS Spanish translations with the default translation quality prediction model: https://console.modelfront.com/#/evaluations/5ea2d6577e71f1001041a310
ModelFront will:
Find bad translations and conflicts in existing NodeJS translations
Boost translation quality prediction accuracy for NodeJS content with translations from other open-source projects like Mozilla
Provide you risk predictions for translation into all languages
Avoid putting multiple sentences into one segment
Create a Do Not Translate list (like Node
, OpenJS Foundation
, ARM
, Windows
, alloc
, Buffer
, SlowBuffer
)
Use Do Not Translate markup (like <code></code>
) consistently
Get custom machine translation (Google, Microsoft, ModernMT... See experiment for Mozilla: https://docs.google.com/spreadsheets/d/1hsxeqxjGhfHkiRT7ce9AUjm2-_q8AH1sPABHhM30ah0/edit)
Consider the null hypothesis - not translating - as an alternative engine
What are your current placeholders?
Bet on machine translation? Or just use machine translation and translation quality prediction as a suggestion?
Integration? Client lib?
(modelfront
is reserved on NPM for https://github.com/modelfront/js but unstarted.)
The initial custom translation quality prediction model for NodeJS covering all NodeJS languages is currently training.
Next steps:
Audit existing NodeJS translations (@bittlingmayer / ModelFront)
Improve custom model for NodeJS using additional outside data (@bittlingmayer / ModelFront)
Ask CrowdIn about how to integrate translaton quality prediction (@alexandrtovmach / NodeJS)
Start using machine translation (@alexandrtovmach / NodeJS)
@alexandrtovmach
The current CrowdIn Webhooks do not have an event for when a string first appears in the system or is first machine translated. The webhooks also do not include the actual text of the source and translation, only an id.
But it seems like something they should have and could easily add.
By the way, this weekend I got to see a demo of the Workflow manager in the new CrowdIn for Enterprise, which is based on the same webhooks.
@bittlingmayer Thanks for investigation, but we're using default Crowdin, not Enterprise. My thoughts about it, is to use cronjob to check for TM updates on weekly basis. I haven't time to check it yet, but I'll paly on a next week
To be clear, webhooks are available in both versions.
I've unarchived this repo so I can close all PRs and issues before re-archiving.
I've a chat with Adam Bittlingmayer - technical co-founder of modelfront.com. He proposed to use their service to help us with making machine translations safe.
Core idea of service is to calculate accuracy risks of machine translations. Potentially, it's reduce count of strings that require a review from translators, and allow us to use machines to speed up localization.
For example, we have 10k strings to translate. We check all of them with modelfront, and as result get the list of the same 10k strings, but with risk value. Then we can make a machine translation of all strings with low risk level (we can suggest min/max limits):
8k < 0.4 1k < 0.8 1k >= 0.8
In this case we just pre-translate 8k of strings, and asks our translators to check other. Sounds interesting for me.
I've invited Adam to our weekly meeting to talk more and ask questions. He also wants to make for us some demo with our data. For this he needs to get our Translation Memory (.tmx) from Crowdin. This can be downloaded only by @nodejs/crowdin-managers , and I'm member of that, but I want to ask any objections first.
Does in some way our TM can be private, and we cannot share with third party services? Thank you @nodejs/i18n @nodejs/i18n-api