<Research>: Transliteration support for non-NLLB scripts

sillsdev / serval

A REST API for natural language processing services

MIT License

4 stars 0 forks source link

There are likely 3 options available:

Custom rules per language using TecKit maps that are reversable
Using an NLP model for transliteration(ish). Here is what I found with a quick search:
- https://linguistics.stackexchange.com/questions/29207/if-romanization-can-be-reversed-back-to-original-script-in-some-languages
- https://yerevann.github.io/2016/09/09/automatic-transliteration-with-lstm/
- https://github.com/YerevaNN/translit-rnn If we train the model on original script and uromanized script, the model may be able to re-insert the lost information during romanization. A RNN or similar may be good enough - though a transformer may be able to determine the difference between instances where 2 words in the other script map to one romanized representation.
Mapping to roman characters but applying a reversable algorithm that preserves uniqueness but may have no bearing into how the words actually sound.

This investigation looks like a good candidate for a senior research project for GCC. @Enkidu93 and @ddaspit - what do you think?

sillsdev / serval