sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

USFM for verse numbers with letters missing text #411

Closed pmachapman closed 4 months ago

pmachapman commented 4 months ago

If the source and target have verse numbers with letters, preceded by a verse with the same number, i.e.

\v 2
\v 2a
\v 2b

Then pre-translations are not generated for verses 2a and 2b.

See the attached Source, Target, and Generated USFM files: Verse Letters USFM.zip

The Brenton Septuagint uses a numbering system like this for verses with letters, particularly in 1 Kings. Other translations likely do too.

A translation engine on QA with this configuration is:

{
  "id": "6657c6cb593d597a09de84f6",
  "url": "/api/v1/translation/engines/6657c6cb593d597a09de84f6",
  "name": "6638216b5749b8b1fd9d184b",
  "sourceLanguage": "mi",
  "targetLanguage": "en",
  "type": "nmt",
  "isModelPersisted": false,
  "isBuilding": false,
  "modelRevision": 3,
  "confidence": 0,
  "corpusSize": 0
}
ddaspit commented 4 months ago

Because all three segments are the same verse, they are concatenated together and then translated. Generally, this gives better results. When the new USFM is generated, the full translation is inserted into the first segment of the verse. All of the text has been translated, we just haven't split the translation back up into individual segments. This is an issue that impacts other cases as well. At some point, we hope to add better support for this. In the meantime, we could add an exception for this specific case, but it might not give the desired results on other projects.

johnml1135 commented 4 months ago

@ddaspitm, this functionality (combining the 3 verses) was broken and had no test to check it. In the fix, I made the functionality work and added an explicit test to ensure that this wouldn't happen again. Specifically in AdvanceRows when compare == 0, the source should not increment, as there may be multiple rows matching the same source. This is what happened here.

ddaspit commented 4 months ago

If you look at the data that Peter uploaded, you will see that it is actually working correctly, because the three verse segments are combined into a single segment for verse 2 when generating pretranslations.

johnml1135 commented 4 months ago

This behavior is as expected. Verse parts are always merged (for the sake of translation quality - this was researched and found to be true). If there are verse ranges in the source or target, the pretranslations for them are also merged.