Open acornestean opened 2 years ago
We should definitely isolate quality issues here. It may also be the case that there are character encoding issues at work here; can you dump raw UTF-8 bytes going into the translator for both scenarios?
The system is non-deterministic. Behind the scenes there are batches built with no guarantee that the same batches will be built each time. One source is quantization happens on the whole activation tensor for the batch using the range of values in the batch. That happens in the older stuff @abhi-agg is using but shouldn't happen with intgemmShiftAlphasAll (which we have asked him to pull). Another source is the lexical shortlist takes the union over sentences in the batch, so sentences can choose from more options based on what happens to be in the shortlist for other sentences in the batch. That we're not controlling for. We should be raising the shortlist size and there's a long-term plan to change to nearest neighbor.
In any case, determinism has been sacrificed for efficiency and ease of implementation. I am comfortable with that decision.
[Affected versions]: Firefox Nightly (95.0a1/20211005215418)
[Affected Platforms]: Windows 10 x64
[Prerequisites]: Access https://mozilla.github.io/translate/.
[Steps to reproduce]:
[Expected]: Since I’m translating the same text but from different sources, the translation results should be the same as well.
[Actual]: Different translation results when translating the same text from different sources.
Note: Pasting the texts via Special Paste (CTRL+SHIFT+V) → “Unformatted text” option, in a word editor reveals some differences between the 2 seemingly identical texts. See attached screenshot:
Video of reproduction steps: https://user-images.githubusercontent.com/50236075/136172742-a68f03e3-9bb0-46ec-8557-7df348039f77.mp4