ronaldtse commented 4 years ago

Some written scripts do not usually specify all phonemic elements in a word. These include Arabic, Syriac, and Hebrew.

In the transliteration of Arabic script to other scripts, the 3 short vowels and the shadda (double consonant) are missing. The fully expressed form of Arabic script is called "pointed script", but it still does not necessarily represent all linguistic elements (e.g. shadda). Most importantly, in the GNDB, most Arabic scripts are written in non-pointed form.

The only trusted mechanism today to fill in the 3 short vowels and the shadda, are through learned experience of conventions (or machine-learning) in the language and culture.

In order to transliterate an Arabic word through different transliteration system, we will need to first extract the missing linguistic information from existing transliterations (from the GNDB) to generate the "fully pointed Arabic" (Arabic script with full linguistic information). Once we have them, we can transliterate these words using any transliteration system.

Therefore the approach is:

For every existing Arabic-transliterated pairs, generate the "fully pointed Arabic". e.g. (Makkah, َمكة) in the BGN system We need to extract the two short "a" vowels, and the 'kk' shadda.
Using the "fully pointed Arabic", we can generate transliterations.
Using the "fully pointed Arabic" and the original "unpointed Arabic", we can feed this into machine-learning (per language) to potentially allow a mapping from "unpointed Arabic" to "fully pointed Arabic".

Some rules:

If a transliteration character is written twice in a word, then add a shadda over the letter in the fully pointed Arabic.
If there are short vowels following consonants in the transliteration, add those short vowels to those consonants in the fully pointed Arabic.
If there are short vowels at the beginning of words, and the hamza is lacking in the Arabic, we need to add the hamza in the fully pointed Arabic.

This is a blocker to:

244 #236 #219 #33 #32 #26 #25 #12 #11 #7

ronaldtse commented 4 years ago

The goal of this task is to transliterate Arabic place names using different transliteration systems. e.g. some systems write "Mekkah", some "Mecca"

In Arabic, in my very simplistic understanding, does not usually write out short vowels and the shadda.

In this screenshot, you can see that the Arabic does not write out all short vowels:

One would only be able to fill in those short vowels if he/she knows the language and context well. The GNDB, as extracted into https://github.com/interscript/geonames-transliteration-data contains a database of human-transliterated place names.

This place name database therefore already contains the short vowels and shadda information in the transliteration columns.

We wish to reverse transliterate these Latin script back to a "fully pointed form of Arabic", such as k => ك, kk => كّ.

With the "fully pointed form", we can then transliterate (forward) this "fully pointed form" using any Arabic transliteration system, such as, ALA (https://www.loc.gov/catdir/cpso/romanization/arabic.docx)

(In the given database, each row is an Arabic / Latin transliterated pair.)

For some other languages, it is not so complex: https://github.com/interscript/interscript/pull/304

Some are more complex: https://github.com/interscript/interscript/pull/258

This task is to:

make the framework "work" with Arabic,
enable the generation of fully pointed Arabic
implement the Arabic => Latin transliteration systems

ronaldtse commented 4 years ago

Ping @AhMohsen46

ronaldtse commented 4 years ago

@AhMohsen46 Feel free to continue on #33 , however:

You will want to work on the transliteration system with a backing data file.

For Arabic, you can see that:

ara_Arab2Latn_BGN_1956 has 26.6 MB
ara_Arab2Latn_ALA_1997 has 14 KB
the other systems only have 1 or few rows, which makes them hard to test

For Persian,

fas_Arab2Latn_BGN_1958 has 31.2 MB
fas_Arab2Latn_ALA_1997 has 24 KB
fas_Arab2Latn_AMMI_1959 has 4 KB
fas_Arab2Latn_NCO_2004 has 5 KB

You will also need to implement reverse transliteration. Right now, the transliteration systems implemented cannot be used in a reversible way. We currently don't have a method of indicating that a rule can be performed in reverse.
Also note that not all transliteration data in geonames-transliteration-data are correct -- there are some mislabeled entries or wrongly transliterated entries (they were done by humans). So the script you create should take that into consideration (i.e. don't fail!)

Thanks!

interscript / interscript-ruby

Implement Arabic transliteration and a "fully-pointed Arabic" form #309

244 #236 #219 #33 #32 #26 #25 #12 #11 #7