Closed pmachapman closed 3 months ago
Most of the "strangely" translated segments are single words that are all-caps. I'm guessing that NLLB just doesn't do a very good job of translating these kinds of segments. Short, all-caps sentences were probably filtered out of the training corpus for NLLB. In any case, we should still verify that the extracted source segments are correct.
What's yet to be done here, @ddaspit ? Just pull down the build data and peek and then also look at the extracts on the bucket?
@Enkidu93 We want to verify that the source segments for the pretranslations are correct. We can do this by checking the pretranslations JSON file.
Here is the original:
\id 2PE Spanish: Dios Habla Hoy DC Estándar 1994 [América Latina]
\ide UTF-8
\rem Copyright Information: For any non-Paratext use of this text, permission must be obtained from the copyright holder.
\rem CAP Information: checked/corrected AT, GZ, EC, 12.12.2008
\h 2 SAN PEDRO
\toc1 Segunda carta de san Pedro
\toc2 2 Pedro
\toc3 2~P
\mt2 Segunda carta de
\mt1 SAN PEDRO
\imt1 Segunda carta de SAN PEDRO
\imt2 INTRODUCCIÓN
\ip La \bk Segunda carta de San Pedro\bk* (2~P) es una advertencia bastante severa a los cristianos para ponerlos en guardia contra ciertas doctrinas extrañas y prácticas reprobables que se habían introducido en algunas iglesias. La carta no menciona, sin embargo, ninguna comunidad cristiana en particular.
\ip El capítulo 2 de esta carta presenta un paralelismo muy grande de ideas y expresiones con la \bk Carta de Judas\bk*, que probablemente es anterior a \bk 2~Pedro\bk*. En cambio, no se encuentra una semejanza notable en lenguaje y doctrina con la \bk Primera carta de Pedro\bk*.
\ip El esquema de la carta es sencillo:
\ib
\io1 Saludo \ior (1.1-2)\ior*
\io1 El llamamiento de Dios y sus exigencias \ior (1.3-11)\ior*
\io1 Autoridad de las enseñanzas \ior (1.12-21)\ior*
\ib
\io1 Los falsos maestros \ior (2)\ior*
\ib
\io1 La segunda venida del Señor \ior (3.1-16)\ior*
\io1 Conclusión \ior (3.17-18)\ior*
\ie
\c 1
\s1 Saludo
What gives?
And here is the json file:
{
"corpusId": "66a2bbd8df779575e75756b9",
"textId": "2PE",
"refs": [
"2PE 1:0/9:mt1"
],
"translation": "SAN PEDRO"
},
{
"corpusId": "66a2bbd8df779575e75756b9",
"textId": "2PE",
"refs": [
"2PE 1:0/10:imt1"
],
"translation": "Segunda carta de SAN PEDRO"
},
{
"corpusId": "66a2bbd8df779575e75756b9",
"textId": "2PE",
"refs": [
"2PE 1:0/11:imt2"
],
"translation": "INTRODUCCI\u00D3N"
},
It really appears to be the capitalization along with the preexisting NLLB content that somehow got mixed up. I'm sorry, I'm sorry, I'm sorry, I'm sorry, I'll see you later.
I don't know if there is any action to do other than warn the user about it.
The extracted segments look correct, so I think this is just an artifact of NLLB. This is probably more likely to happen if no training data is specified and the model isn't fine-tuned. I'm going to close the issue.
When using the DHH94 as a source, and an English translation as the target, the following translation engine on QA:
Is producing very strange translations when retrieving the drafts for the following build:
(the build did run very quickly!)
For example, https://qa.serval-api.org/api/v1/translation/engines/66a2bbd1df779575e75756b6/corpora/66a2bbd8df779575e75756b9/pretranslations/2PE/usfm?text-origin=OnlyPretranslated&template=Source returns:
Setting the template to Target gives just the Scripture text, which is much more accurate.
@johnml1135 I have emailed your the source and target zip files.