nodejs / i18n

The Node.js Internationalization Working Group – A Community Committee initiative.
MIT License
150 stars 40 forks source link

Machine translation risks #319

Closed alexandrtovmach closed 1 year ago

alexandrtovmach commented 4 years ago

I've a chat with Adam Bittlingmayer - technical co-founder of modelfront.com. He proposed to use their service to help us with making machine translations safe.

Core idea of service is to calculate accuracy risks of machine translations. Potentially, it's reduce count of strings that require a review from translators, and allow us to use machines to speed up localization.

For example, we have 10k strings to translate. We check all of them with modelfront, and as result get the list of the same 10k strings, but with risk value. Then we can make a machine translation of all strings with low risk level (we can suggest min/max limits):

8k < 0.4 1k < 0.8 1k >= 0.8

In this case we just pre-translate 8k of strings, and asks our translators to check other. Sounds interesting for me.

I've invited Adam to our weekly meeting to talk more and ask questions. He also wants to make for us some demo with our data. For this he needs to get our Translation Memory (.tmx) from Crowdin. This can be downloaded only by @nodejs/crowdin-managers , and I'm member of that, but I want to ask any objections first.

Does in some way our TM can be private, and we cannot share with third party services? Thank you @nodejs/i18n @nodejs/i18n-api

zeke commented 4 years ago

cc @alebourne who has experience working with machine translation.

bittlingmayer commented 4 years ago

Thanks Alexandr for the very thorough explanation.

To be clear, we're talking about data which is already public (open https://crowdin.com/project/nodejs/uk in an incognito tab), just not in a convenient format for moving across applications (a single parallel file). Also the CrowdIn files don't include the original English strings, so we would have to some harvest the English from this repo and align them.

TMX, which CrowdIn supports, is designed exactly for this, TSV or JSON also works.

alebourne commented 4 years ago

Hi @zeke Thank you for mentioning me. I took a brief look at the modelfront.com site and it sounds interesting to explore. One of the trickiest things in machine translation is that it is suitable for some but not necessarily for all sentences. So, if one could determine which ones are the ones that would provide the best results with MT, that could be quite useful. I would also be interested in attending the meeting if possible. I'd like to see the demo.

alexandrtovmach commented 4 years ago

Okay, so I'm going to provide our .tmx file to Adam, so he can prepare a demo

alexandrtovmach commented 4 years ago

@bittlingmayer nodejs.zip

bittlingmayer commented 4 years ago

Thanks, the format looks good. I've flattened it out into parallel data files for each pair.

There are 58K English originals (but roughly 1/3 are repeats, although their translations may be different). These look like only the approved translations.

It's fairly clean, there are a few segments that have multiple sentences or newline chars. (It's not trivial to convert these to multiple segments - sentence splitting is not a solved problem, and the number of newline chars in source and target is not always the same.)

Across all the languages, there are 106K segment pairs total.

1638 en.ar.tsv
  18 en.bg.tsv
 136 en.ca.tsv
   3 en.cs.tsv
  22 en.da.tsv
1506 en.de.tsv
1705 en.el.tsv
51892 en.es.tsv
 184 en.fa.tsv
 137 en.fi.tsv
2071 en.fr.tsv
  33 en.he.tsv
 178 en.hi.tsv
  11 en.hr.tsv
  16 en.hu.tsv
 252 en.id.tsv
16445 en.it.tsv
 917 en.ja.tsv
 956 en.ko.tsv
1936 en.nl.tsv
   1 en.no.tsv
 979 en.pl.tsv
5549 en.pt.tsv
 558 en.ro.tsv
7479 en.ru.tsv
   7 en.sk.tsv
 119 en.sr.tsv
1269 en.sv.tsv
  34 en.te.tsv
   1 en.th.tsv
1098 en.tr.tsv
 433 en.uk.tsv
 312 en.vi.tsv
7502 en.zh-cn.tsv
 266 en.zh-tw.tsv

I'll train a custom translation quality prediction model for you with this now.

We can talk about how to get you quality custom translation even into the languages for which you don't have much data.

bittlingmayer commented 4 years ago

For i18n WG Meeting - April 24

About translation quality prediction and ModelFront

We catch bad translations.

en: 'This is not a test.', pt: 'Isto não é uma prova.'
0.011319409

en: 'This is not a test.', pt: 'Isto é uma prova.'
0.9994029 

1% risk means the translation is probably good, 99% risk means the translation is probably bad.

Based on deep learning, supports all languages and custom models, machine translation or human translation

API and console (and ML/NLP pipeline) built with NodeJS <3 <3 <3

https://modelfront.com/docs/api/

Support for open projects

Demo: Mozilla

Technical content with many tags

Pontoon, machine translation, 100+ languages

Human translations: https://console.modelfront.com/#/evaluations/5e7241fa1e793e0010b4ade5

Machine translations: https://console.modelfront.com/#/evaluations/5e73244aa0d73f001043cc82

https://docs.google.com/spreadsheets/d/1z0Plc7SD8QhjwVo0SK2gJgDoHEWslFCZvziXC7prs7E/view#gid=0

Look at the distribution!

Challenges for NodeJS

Human translator efficiency

Technical content with many non-translatable segments and tokens

Occasional localisation of code

https://nodejs.org/en/download/ https://nodejs.org/zh-cn/download/

Conflicting translations in existing NodeJS translations

cat en.es.tsv | grep "Child Processes"
Child Processes Procesos Hijos
Child Processes Procesos Secundarios

Not many existing NodeJS translations for training data - 100K total, < 1K for most langs

More on my first look at NodeJS translation data in #319

Initial eval of NodeJS Spanish translations with the default translation quality prediction model: https://console.modelfront.com/#/evaluations/5ea2d6577e71f1001041a310

Next steps

ModelFront will:

Find bad translations and conflicts in existing NodeJS translations

Boost translation quality prediction accuracy for NodeJS content with translations from other open-source projects like Mozilla

Provide you risk predictions for translation into all languages

Recommendations for machine *

Avoid putting multiple sentences into one segment

Create a Do Not Translate list (like Node, OpenJS Foundation, ARM, Windows, alloc, Buffer, SlowBuffer)

Use Do Not Translate markup (like <code></code>) consistently

Get custom machine translation (Google, Microsoft, ModernMT... See experiment for Mozilla: https://docs.google.com/spreadsheets/d/1hsxeqxjGhfHkiRT7ce9AUjm2-_q8AH1sPABHhM30ah0/edit)

Consider the null hypothesis - not translating - as an alternative engine

Questions

What are your current placeholders?

Bet on machine translation? Or just use machine translation and translation quality prediction as a suggestion?

Integration? Client lib? (modelfront is reserved on NPM for https://github.com/modelfront/js but unstarted.)

bittlingmayer commented 4 years ago

The initial custom translation quality prediction model for NodeJS covering all NodeJS languages is currently training.

Next steps:

bittlingmayer commented 4 years ago

@alexandrtovmach

The current CrowdIn Webhooks do not have an event for when a string first appears in the system or is first machine translated. The webhooks also do not include the actual text of the source and translation, only an id.

But it seems like something they should have and could easily add.

By the way, this weekend I got to see a demo of the Workflow manager in the new CrowdIn for Enterprise, which is based on the same webhooks.

Screenshot 2020-04-27 at 12 49 59
alexandrtovmach commented 4 years ago

@bittlingmayer Thanks for investigation, but we're using default Crowdin, not Enterprise. My thoughts about it, is to use cronjob to check for TM updates on weekly basis. I haven't time to check it yet, but I'll paly on a next week

bittlingmayer commented 4 years ago

To be clear, webhooks are available in both versions.

Trott commented 1 year ago

I've unarchived this repo so I can close all PRs and issues before re-archiving.