mozilla / translate

Translations website utilizing Bergamot proceedings
https://mozilla.github.io/translate
Mozilla Public License 2.0
59 stars 15 forks source link

Issue translating same text from 2 different sources, from DE to EN #12

Open acornestean opened 2 years ago

acornestean commented 2 years ago

[Affected versions]: Firefox Nightly (95.0a1/20211005215418)

[Affected Platforms]: Windows 10 x64

[Prerequisites]: Access https://mozilla.github.io/translate/.

[Steps to reproduce]:

  1. Access https://de.wikipedia.org/wiki/Wikipedia:Hauptseite and copy part of the available “Artikel des Tages” text. (the one I used was an article about Riparia Bridge and I copied the following portion: “ Die Riparia Bridge war eine eingleisige Eisenbahnbrücke über den Snake River zwischen dem Whitman County und dem Columbia County im Südosten des Bundesstaates Washington. Die von George S. Morison entworfene Fachwerkbrücke war eine der ersten Stahlbrücken in den USA. Sie wurde bis 1889 von der Oregon Railroad and Navigation Company (OR&N) errichtet ”
  2. Paste the text in the “From” field on the translation website, which was previously configured for translating DE to EN. Wait for the text to be translated.
  3. Notice the translation errors i.e untranslated words, added characters to some words, etc.
  4. On the “Artikel des Tages” section, click on the “Zum Artikel...” link at the end to go to the full article page.
  5. On the article page, notice that the initial portion of the article is the same as the one available in the “Artikel des Tages” section on the previous page.
  6. Copy the same portion of the article as in the “Artikel des Tages” sction and paste it in the “From” field of the translation website. Wait for the text to be translated.
  7. Notice that the text is correctly translated, with no errors in translation, no untranslated words or added characters.

[Expected]: Since I’m translating the same text but from different sources, the translation results should be the same as well.

[Actual]: Different translation results when translating the same text from different sources.

Note: Pasting the texts via Special Paste (CTRL+SHIFT+V) → “Unformatted text” option, in a word editor reveals some differences between the 2 seemingly identical texts. See attached screenshot: 2021-10-06_12h01_09

Video of reproduction steps: https://user-images.githubusercontent.com/50236075/136172742-a68f03e3-9bb0-46ec-8557-7df348039f77.mp4

kpu commented 2 years ago

We should definitely isolate quality issues here. It may also be the case that there are character encoding issues at work here; can you dump raw UTF-8 bytes going into the translator for both scenarios?

The system is non-deterministic. Behind the scenes there are batches built with no guarantee that the same batches will be built each time. One source is quantization happens on the whole activation tensor for the batch using the range of values in the batch. That happens in the older stuff @abhi-agg is using but shouldn't happen with intgemmShiftAlphasAll (which we have asked him to pull). Another source is the lexical shortlist takes the union over sentences in the batch, so sentences can choose from more options based on what happens to be in the shortlist for other sentences in the batch. That we're not controlling for. We should be raising the shortlist size and there's a long-term plan to change to nearest neighbor.

In any case, determinism has been sacrificed for efficiency and ease of implementation. I am comfortable with that decision.