sillsdev / serval

A REST API for natural language processing services
MIT License
4 stars 0 forks source link

<Research>: Transliteration support for non-NLLB scripts #353

Open mmartin9684-sil opened 5 months ago

mmartin9684-sil commented 5 months ago

When a source/target language is written in a script that is not known to NLLB, drafting accuracy to/from that language seems to improve if the text is "transliterated" to Latin script using the uroman utility. However, the uroman utility is not bi-directional, so there's no way to generate predictions / translations in the original script.

Further research into the options for a bidirectional solution are needed, such as TecKit maps from WSTech. Research would also be useful to identify what scenarios can benefit from bidirectional transliteration (new languages in new scripts; new languages in existing scripts (which scripts?); etc).

johnml1135 commented 5 months ago

There are likely 3 options available:

  1. Custom rules per language using TecKit maps that are reversable
  2. Using an NLP model for transliteration(ish). Here is what I found with a quick search:
  3. Mapping to roman characters but applying a reversable algorithm that preserves uniqueness but may have no bearing into how the words actually sound.

This investigation looks like a good candidate for a senior research project for GCC. @Enkidu93 and @ddaspit - what do you think?