Open ronaldtse opened 4 years ago
The goal of this task is to transliterate Arabic place names using different transliteration systems. e.g. some systems write "Mekkah", some "Mecca"
In Arabic, in my very simplistic understanding, does not usually write out short vowels and the shadda.
In this screenshot, you can see that the Arabic does not write out all short vowels:
One would only be able to fill in those short vowels if he/she knows the language and context well. The GNDB, as extracted into https://github.com/interscript/geonames-transliteration-data contains a database of human-transliterated place names.
This place name database therefore already contains the short vowels and shadda information in the transliteration columns.
We wish to reverse transliterate these Latin script back to a "fully pointed form of Arabic", such as k => ك, kk => كّ.
With the "fully pointed form", we can then transliterate (forward) this "fully pointed form" using any Arabic transliteration system, such as, ALA (https://www.loc.gov/catdir/cpso/romanization/arabic.docx)
(In the given database, each row is an Arabic / Latin transliterated pair.)
For some other languages, it is not so complex: https://github.com/interscript/interscript/pull/304
Some are more complex: https://github.com/interscript/interscript/pull/258
This task is to:
Ping @AhMohsen46
@AhMohsen46 Feel free to continue on #33 , however:
For Arabic, you can see that:
For Persian,
You will also need to implement reverse transliteration. Right now, the transliteration systems implemented cannot be used in a reversible way. We currently don't have a method of indicating that a rule can be performed in reverse.
Also note that not all transliteration data in geonames-transliteration-data are correct -- there are some mislabeled entries or wrongly transliterated entries (they were done by humans). So the script you create should take that into consideration (i.e. don't fail!)
Thanks!
Some written scripts do not usually specify all phonemic elements in a word. These include Arabic, Syriac, and Hebrew.
In the transliteration of Arabic script to other scripts, the 3 short vowels and the shadda (double consonant) are missing. The fully expressed form of Arabic script is called "pointed script", but it still does not necessarily represent all linguistic elements (e.g. shadda). Most importantly, in the GNDB, most Arabic scripts are written in non-pointed form.
The only trusted mechanism today to fill in the 3 short vowels and the shadda, are through learned experience of conventions (or machine-learning) in the language and culture.
In order to transliterate an Arabic word through different transliteration system, we will need to first extract the missing linguistic information from existing transliterations (from the GNDB) to generate the "fully pointed Arabic" (Arabic script with full linguistic information). Once we have them, we can transliterate these words using any transliteration system.
Therefore the approach is:
For every existing Arabic-transliterated pairs, generate the "fully pointed Arabic". e.g. (Makkah, َمكة) in the BGN system We need to extract the two short "a" vowels, and the 'kk' shadda.
Using the "fully pointed Arabic", we can generate transliterations.
Using the "fully pointed Arabic" and the original "unpointed Arabic", we can feed this into machine-learning (per language) to potentially allow a mapping from "unpointed Arabic" to "fully pointed Arabic".
Some rules:
If a transliteration character is written twice in a word, then add a shadda over the letter in the fully pointed Arabic.
If there are short vowels following consonants in the transliteration, add those short vowels to those consonants in the fully pointed Arabic.
If there are short vowels at the beginning of words, and the hamza is lacking in the Arabic, we need to add the hamza in the fully pointed Arabic.
This is a blocker to:
244 #236 #219 #33 #32 #26 #25 #12 #11 #7