When a source/target language is written in a script that is not known to NLLB, drafting accuracy to/from that language seems to improve if the text is "transliterated" to Latin script using the uroman utility. However, the uroman utility is not bi-directional, so there's no way to generate predictions / translations in the original script.
Further research into the options for a bidirectional solution are needed, such as TecKit maps from WSTech. Research would also be useful to identify what scenarios can benefit from bidirectional transliteration (new languages in new scripts; new languages in existing scripts (which scripts?); etc).
https://github.com/YerevaNN/translit-rnn
If we train the model on original script and uromanized script, the model may be able to re-insert the lost information during romanization. A RNN or similar may be good enough - though a transformer may be able to determine the difference between instances where 2 words in the other script map to one romanized representation.
Mapping to roman characters but applying a reversable algorithm that preserves uniqueness but may have no bearing into how the words actually sound.
This investigation looks like a good candidate for a senior research project for GCC. @Enkidu93 and @ddaspit - what do you think?
When a source/target language is written in a script that is not known to NLLB, drafting accuracy to/from that language seems to improve if the text is "transliterated" to Latin script using the uroman utility. However, the uroman utility is not bi-directional, so there's no way to generate predictions / translations in the original script.
Further research into the options for a bidirectional solution are needed, such as TecKit maps from WSTech. Research would also be useful to identify what scenarios can benefit from bidirectional transliteration (new languages in new scripts; new languages in existing scripts (which scripts?); etc).