With mahdi, we have identified a number of challenges peculiar to Farsi:
Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
Transliteration of single words:
Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
The above table is quite extensive and could be used.
Research shows that transliteration can be better learned with NNets than with rules.
The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
Transliteration of several words
In Farsi, words get pre/suffixes depending on their position and role in a sentence.
As a consequence, we think of using a PoS tagging technology
PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.
Farsi
Transliteration in Farsi
With mahdi, we have identified a number of challenges peculiar to Farsi:
Ideas (bad and goods)
Plan