sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

Very unusual translations for Spanish to English in non-scripture material #443

Closed pmachapman closed 1 month ago

pmachapman commented 1 month ago

When using the DHH94 as a source, and an English translation as the target, the following translation engine on QA:

{
  "id": "66a2bbd1df779575e75756b6",
  "url": "/api/v1/translation/engines/66a2bbd1df779575e75756b6",
  "name": "66a2ba762b38960066aac33b",
  "sourceLanguage": "es",
  "targetLanguage": "en",
  "type": "nmt",
  "isModelPersisted": false,
  "isBuilding": false,
  "modelRevision": 1,
  "confidence": 0,
  "corpusSize": 0
}

Is producing very strange translations when retrieving the drafts for the following build:

[
  {
    "id": "66a2bbd8df779575e75756ba",
    "url": "/api/v1/translation/engines/66a2bbd1df779575e75756b6/builds/66a2bbd8df779575e75756ba",
    "revision": 15,
    "engine": {
      "id": "66a2bbd1df779575e75756b6",
      "url": "/api/v1/translation/engines/66a2bbd1df779575e75756b6"
    },
    "trainOn": [
      {
        "corpus": {
          "id": "66a2bbd8df779575e75756b9",
          "url": "/api/v1/translation/engines/66a2bbd1df779575e75756b6/corpora/66a2bbd8df779575e75756b9"
        },
        "textIds": []
      }
    ],
    "pretranslate": [
      {
        "corpus": {
          "id": "66a2bbd8df779575e75756b9",
          "url": "/api/v1/translation/engines/66a2bbd1df779575e75756b6/corpora/66a2bbd8df779575e75756b9"
        },
        "scriptureRange": "GEN;EXO;LEV;NUM;NEH;JOB;1TI;2TI;TIT;JAS;1PE;2PE;1JN;2JN;3JN;JUD"
      }
    ],
    "step": 0,
    "percentCompleted": 1,
    "message": "Completed",
    "queueDepth": 0,
    "state": "Completed",
    "dateFinished": "2024-07-25T21:12:37.927Z",
    "options": {}
  }
]

(the build did run very quickly!)

For example, https://qa.serval-api.org/api/v1/translation/engines/66a2bbd1df779575e75756b6/corpora/66a2bbd8df779575e75756b9/pretranslations/2PE/usfm?text-origin=OnlyPretranslated&template=Source returns:

\id 2PE - 2024 Project Dev
\ide UTF-8
\rem Copyright Information: For any non-Paratext use of this text, permission must be obtained from the copyright holder.
\rem CAP Information: checked/corrected by AT, GZ, EC, 12.12.2008
\h 2 of the S.P.R.O.
\toc1 Second Letter of St. Peter
\toc2 2 Peter
\toc3 2 P
\mt2 Second letter from
\mt1 - I 'm sorry , I 'm sorry .
\imt1 Second letter of St. Peter
\imt2 This is the first time that the European Parliament has taken a position on this issue.
\ip The Second Letter of St. Peter (2 P) is a rather stern warning to Christians to warn them against certain strange doctrines and reprehensible practices that had been introduced into some churches.
\ip Chapter 2 of this letter presents a very large parallel of ideas and expressions with the Epistle of Jude, which probably predates 2 Peter.
\ip The scheme of the letter is simple:
\ib
\io1 Greetings (1.1-2)
\io1 God's Calling and His Requirements, 3-11
\io1 Authoritative teaching, 12-21
\ib
\io1 The False Teachers (2)
\ib
\io1 The second coming of the Lord (3:1-16)
\io1 The Commission is proposing to extend the deadline for tabling amendments to this Regulation.
\ie
\c 1
\s1 I 'll see you later .
\p
\v 1 Simon Peter, a servant and apostle of Jesus Christ, To those who through the righteousness of our God and Savior Jesus Christ have received a faith as precious as ours:
\v 2 Grace and peace be yours in abundance through the knowledge of God and of Jesus our Lord.

Setting the template to Target gives just the Scripture text, which is much more accurate.

@johnml1135 I have emailed your the source and target zip files.

ddaspit commented 1 month ago

Most of the "strangely" translated segments are single words that are all-caps. I'm guessing that NLLB just doesn't do a very good job of translating these kinds of segments. Short, all-caps sentences were probably filtered out of the training corpus for NLLB. In any case, we should still verify that the extracted source segments are correct.

Enkidu93 commented 1 month ago

What's yet to be done here, @ddaspit ? Just pull down the build data and peek and then also look at the extracts on the bucket?

ddaspit commented 1 month ago

@Enkidu93 We want to verify that the source segments for the pretranslations are correct. We can do this by checking the pretranslations JSON file.

johnml1135 commented 1 month ago

Here is the original:

\id 2PE Spanish: Dios Habla Hoy DC Estándar 1994 [América Latina]
\ide UTF-8
\rem Copyright Information: For any non-Paratext use of this text, permission must be obtained from the copyright holder.
\rem CAP Information: checked/corrected AT, GZ, EC, 12.12.2008
\h 2 SAN PEDRO
\toc1 Segunda carta de san Pedro
\toc2 2 Pedro
\toc3 2~P
\mt2 Segunda carta de
\mt1 SAN PEDRO
\imt1 Segunda carta de SAN PEDRO
\imt2 INTRODUCCIÓN
\ip La \bk Segunda carta de San Pedro\bk* (2~P) es una advertencia bastante severa a los cristianos para ponerlos en guardia contra ciertas doctrinas extrañas y prácticas reprobables que se habían introducido en algunas iglesias. La carta no menciona, sin embargo, ninguna comunidad cristiana en particular.
\ip El capítulo 2 de esta carta presenta un paralelismo muy grande de ideas y expresiones con la \bk Carta de Judas\bk*, que probablemente es anterior a \bk 2~Pedro\bk*. En cambio, no se encuentra una semejanza notable en lenguaje y doctrina con la \bk Primera carta de Pedro\bk*.
\ip El esquema de la carta es sencillo:
\ib
\io1 Saludo \ior (1.1-2)\ior*
\io1 El llamamiento de Dios y sus exigencias \ior (1.3-11)\ior*
\io1 Autoridad de las enseñanzas \ior (1.12-21)\ior*
\ib
\io1 Los falsos maestros \ior (2)\ior*
\ib
\io1 La segunda venida del Señor \ior (3.1-16)\ior*
\io1 Conclusión \ior (3.17-18)\ior*
\ie
\c 1
\s1 Saludo

What gives?

johnml1135 commented 1 month ago

And here is the json file:

  {
    "corpusId": "66a2bbd8df779575e75756b9",
    "textId": "2PE",
    "refs": [
      "2PE 1:0/9:mt1"
    ],
    "translation": "SAN PEDRO"
  },
  {
    "corpusId": "66a2bbd8df779575e75756b9",
    "textId": "2PE",
    "refs": [
      "2PE 1:0/10:imt1"
    ],
    "translation": "Segunda carta de SAN PEDRO"
  },
  {
    "corpusId": "66a2bbd8df779575e75756b9",
    "textId": "2PE",
    "refs": [
      "2PE 1:0/11:imt2"
    ],
    "translation": "INTRODUCCI\u00D3N"
  },

It really appears to be the capitalization along with the preexisting NLLB content that somehow got mixed up. I'm sorry, I'm sorry, I'm sorry, I'm sorry, I'll see you later.

I don't know if there is any action to do other than warn the user about it.

ddaspit commented 1 month ago

The extracted segments look correct, so I think this is just an artifact of NLLB. This is probably more likely to happen if no training data is specified and the model isn't fine-tuned. I'm going to close the issue.