interscript / maps

Script conversion maps for Interscript
2 stars 1 forks source link

Problem with Arabic transliteration #7

Open ronaldtse opened 3 years ago

ronaldtse commented 3 years ago

From @gilgameshjw 's run using GNDB data.

ara ara_diacri ara_latinised DEST_FULL_NAME_RO index dist_edit dist_jaro_winkler
0 گرجان گرِجانَ grijna 0 girjān 0.666667 0.177778
1 چم كورك چمَ كُوَرِكَ chma kūarika 1 cham kūrik 0.400000 0.088889
2 وادي نوباندي وَادِي نُوبَانْدِي wādī nūbāndī 2 wādī nūbāndī 0.000000 0.000000
3 وادي خازيانلي وَادِي خَازِيَانْلِيٍّ wādī khāziyānlīyin 3 wādī khāzyānlī 0.285714 0.074074
4 وادي ام بطمة وَادِي امْ بُطْمَةَ wādī am buṭmata 4 wādī umm buţmah 0.333333 0.238384
... ... ... ... ... ... ... ...
89 القباقب القَبَاقِبُ al-qabāqibu 89 al qabāqib 0.200000 0.093939
90 العِقلة العَقْلَةِ al-‘aqlahi 90 al ‘iqlah 0.333333 0.221693
91 الظهرور الظُّهْرُورُ al-ẓẓuhrūru 91 az̧ z̧ahrūr 0.636364 0.363636
92 أم الدنانير أَمْ الدَّنَانِيرَ am al-ddanānīra 92 umm ad danānīr 0.428571 0.220924
93 أرض الرجوم أَرْضِ الرُّجُومِ arḍi al-rrujūmi 93 arḑ ar rujūm 0.500000 0.166667

Clearly there is some difference in certain entries, if you look at 91 and 93, the transliteration system is different.

@gilgameshjw can you help confirm:

Method to easily reproduce this output? 😉 Thanks!

gilgameshjw commented 3 years ago

Analysis and Answers:

analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt .

AhMohsen46 commented 3 years ago

Hello, Here’s what am thinking I’ll take the file jair created, and comment on every possible entry with error Whether it’s a mapping issue or a pointing issue

To start with 1-I have a strong feeling the map used is not the best match for this dataset, as I can see in the examples provided by Ronald the sun letters rules are applied to the results, while the map used, doesn’t have sun letters

2-I can see that the output contains the final letter diacritic, which might/might not be omitted, based on the sentence Almost similar to how last letters in french are omitted sometimes in pronunciation — this can be modified in the maps, as a rule to optionally omit the three main diacritics (fatha-damma-kasragh) if they’re on the last letters of the word

On 1 Aug 2021, at 5:20 PM, gilgameshjw @.***> wrote:  Analysis and Answers:

GNDB dataset? I am using : GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv

transliteration systems:

interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1 reproduce output: We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we ran, in part under /python in rababa. We also added analysis files. analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

gilgameshjw commented 3 years ago

@AhMohsen46 @ronaldtse mentionned on skype (which you might not be able to access) that possibly, the GNDB diacritized data itself could be bad.

Maybe we can investigate on that?

ronaldtse commented 3 years ago

@AhMohsen46 as mentioned by @gilgameshjw , the GNDB datasets may contain mis-tagged transliterations. e.g. most of the Arabic transliterations should actually be BGN/PCGN, but some may be mis-tagged as ALA-LC.

Part of our work here is also find a good way to detect mis-tagged transliteration. The new "detect" feature in Interscript should help.