sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

Arabic Pre-Translations are being generated in Latin script #199

Closed pmachapman closed 1 year ago

pmachapman commented 1 year ago

When we generate pre-translations with the source as English, and with no data in the target text, the Arabic pre-translations look like:

  {
    "textId": "40_1",
    "refs": [
      "40_1:verse_001_014"
    ],
    "translation": "arbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarbarb"
  },
  {
    "textId": "40_1",
    "refs": [
      "40_1:verse_001_015"
    ],
    "translation": "arbari , Eliud , Eleazar , Matthan , Jacob ,"
  },
  {
    "textId": "40_1",
    "refs": [
      "40_1:verse_001_016"
    ],
    "translation": "arboto , Jacob , Joseph , marié de Marie , de qui est né Jésus , qui est appelé Christ ."
  },
  {
    "textId": "40_1",
    "refs": [
      "40_1:verse_001_017"
    ],
    "translation": "arboboooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo"
  },
  {
    "textId": "40_1",
    "refs": [
      "40_1:verse_001_018"
    ],
    "translation": "arbari: Maria , sa mama , era fiancée à Joseph , avant qu' ils se rencontrassent , elle se trouva enceinte du Saint-Esprit ."
  },

I don't think this is a QA issue (As Kannada is generated in Kannada script).

I have attached sample source and target files (en-arb.zip), and details of the test translation engine are:

{
  "id": "6539732e222f10274a961615",
  "url": "/api/v1/translation/engines/6539732e222f10274a961615",
  "name": "653970f27a164a5ff23c9d91",
  "sourceLanguage": "en",
  "targetLanguage": "arb",
  "type": "Nmt",
  "isBuilding": false,
  "modelRevision": 2,
  "confidence": 0,
  "corpusSize": 0
}
ddaspit commented 1 year ago

This occurs because the arb code does not have a default script. Clearly, the IetfLanguageTag.TryGetSubtags is not sufficient to get the default script code for a language tag. We need to write our own mechanism for finding the default script code for a language tag. The SLDR langtags.json file should contain sufficient information to lookup the default script. We will need to parse it directly.