interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Implement Arabic transliteration and a "fully-pointed Arabic" form #309

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

Some written scripts do not usually specify all phonemic elements in a word. These include Arabic, Syriac, and Hebrew.

In the transliteration of Arabic script to other scripts, the 3 short vowels and the shadda (double consonant) are missing. The fully expressed form of Arabic script is called "pointed script", but it still does not necessarily represent all linguistic elements (e.g. shadda). Most importantly, in the GNDB, most Arabic scripts are written in non-pointed form.

The only trusted mechanism today to fill in the 3 short vowels and the shadda, are through learned experience of conventions (or machine-learning) in the language and culture.

In order to transliterate an Arabic word through different transliteration system, we will need to first extract the missing linguistic information from existing transliterations (from the GNDB) to generate the "fully pointed Arabic" (Arabic script with full linguistic information). Once we have them, we can transliterate these words using any transliteration system.

Therefore the approach is:

  1. For every existing Arabic-transliterated pairs, generate the "fully pointed Arabic". e.g. (Makkah, َمكة) in the BGN system We need to extract the two short "a" vowels, and the 'kk' shadda.

  2. Using the "fully pointed Arabic", we can generate transliterations.

  3. Using the "fully pointed Arabic" and the original "unpointed Arabic", we can feed this into machine-learning (per language) to potentially allow a mapping from "unpointed Arabic" to "fully pointed Arabic".

Some rules:

This is a blocker to:

244 #236 #219 #33 #32 #26 #25 #12 #11 #7

ronaldtse commented 4 years ago

The goal of this task is to transliterate Arabic place names using different transliteration systems. e.g. some systems write "Mekkah", some "Mecca"

In Arabic, in my very simplistic understanding, does not usually write out short vowels and the shadda.

In this screenshot, you can see that the Arabic does not write out all short vowels:

image

One would only be able to fill in those short vowels if he/she knows the language and context well. The GNDB, as extracted into https://github.com/interscript/geonames-transliteration-data contains a database of human-transliterated place names.

This place name database therefore already contains the short vowels and shadda information in the transliteration columns.

We wish to reverse transliterate these Latin script back to a "fully pointed form of Arabic", such as k => ك, kk => كّ.

With the "fully pointed form", we can then transliterate (forward) this "fully pointed form" using any Arabic transliteration system, such as, ALA (https://www.loc.gov/catdir/cpso/romanization/arabic.docx)

(In the given database, each row is an Arabic / Latin transliterated pair.)

For some other languages, it is not so complex: https://github.com/interscript/interscript/pull/304

Some are more complex: https://github.com/interscript/interscript/pull/258

This task is to:

  1. make the framework "work" with Arabic,
  2. enable the generation of fully pointed Arabic
  3. implement the Arabic => Latin transliteration systems
ronaldtse commented 4 years ago

Ping @AhMohsen46

ronaldtse commented 4 years ago

@AhMohsen46 Feel free to continue on #33 , however:

  1. You will want to work on the transliteration system with a backing data file.

For Arabic, you can see that:

Screen Shot 2020-08-16 at 8 21 48 PM

For Persian,

Screen Shot 2020-08-16 at 8 24 08 PM
  1. You will also need to implement reverse transliteration. Right now, the transliteration systems implemented cannot be used in a reversible way. We currently don't have a method of indicating that a rule can be performed in reverse.

  2. Also note that not all transliteration data in geonames-transliteration-data are correct -- there are some mislabeled entries or wrongly transliterated entries (they were done by humans). So the script you create should take that into consideration (i.e. don't fail!)

Thanks!