interscript / rababa

Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)
12 stars 1 forks source link

Farsi #41

Open gilgameshjw opened 2 years ago

gilgameshjw commented 2 years ago

Farsi

Transliteration in Farsi

With mahdi, we have identified a number of challenges peculiar to Farsi:

  1. Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
  2. Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
  3. Transliteration of single words:
    • Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
    • The above table is quite extensive and could be used.
    • Research shows that transliteration can be better learned with NNets than with rules.
    • The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
  4. Transliteration of several words
    • In Farsi, words get pre/suffixes depending on their position and role in a sentence.
    • As a consequence, we think of using a PoS tagging technology
    • PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.

Ideas (bad and goods)

Plan

  1. Look for mappings: farsi $\Rightarrow$ +- latine Done
  2. Stats of collisions and concept validation 952 collisions for 50k dictionary, 0.5% at word level. Done, Validated
  3. Create git branch so that Mahdi+Jair can collaborate Done
  4. Run simplest possible transliteration:
    • Mahdi provides dataset
    • Jair build naive map and transliterate (model 0)
    • Ronald, Mahdi, Jair: feedbacks
  5. Review NLP libraries, codebases and research in Farsi.
  6. Improve (char normalisation, preprocessing and PoS)