Problem with Arabic transliteration

ronaldtse commented 3 years ago

From @gilgameshjw 's run using GNDB data.

ara is the source
ara_diacri is the diacriticized Arabic produced with rababa
DEST_FULL_NAME_RO is the manual transliteration provided in GNDB
ara_latinised is the output of Interscript

ara	ara_diacri	ara_latinised	DEST_FULL_NAME_RO	index	dist_edit	dist_jaro_winkler
0	گرجان	گرِجانَ	grijna	0	girjān	0.666667	0.177778
1	چم كورك	چمَ كُوَرِكَ	chma kūarika	1	cham kūrik	0.400000	0.088889
2	وادي نوباندي	وَادِي نُوبَانْدِي	wādī nūbāndī	2	wādī nūbāndī	0.000000	0.000000
3	وادي خازيانلي	وَادِي خَازِيَانْلِيٍّ	wādī khāziyānlīyin	3	wādī khāzyānlī	0.285714	0.074074
4	وادي ام بطمة	وَادِي امْ بُطْمَةَ	wādī am buṭmata	4	wādī umm buţmah	0.333333	0.238384
...	...	...	...	...	...	...	...
89	القباقب	القَبَاقِبُ	al-qabāqibu	89	al qabāqib	0.200000	0.093939
90	العِقلة	العَقْلَةِ	al-‘aqlahi	90	al ‘iqlah	0.333333	0.221693
91	الظهرور	الظُّهْرُورُ	al-ẓẓuhrūru	91	az̧ z̧ahrūr	0.636364	0.363636
92	أم الدنانير	أَمْ الدَّنَانِيرَ	am al-ddanānīra	92	umm ad danānīr	0.428571	0.220924
93	أرض الرجوم	أَرْضِ الرُّجُومِ	arḍi al-rrujūmi	93	arḑ ar rujūm	0.500000	0.166667

Clearly there is some difference in certain entries, if you look at 91 and 93, the transliteration system is different.

@gilgameshjw can you help confirm:

which GNDB dataset are you using?
which transliteration system are you using?

Method to easily reproduce this output? 😉 Thanks!

gilgameshjw commented 3 years ago

Analysis and Answers:

GNDB dataset? I am using : GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv

transliteration systems:

interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt
interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1

reproduce output: We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we ran, in part under /python in rababa. We also added analysis files.

analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt .

AhMohsen46 commented 3 years ago

Hello, Here’s what am thinking I’ll take the file jair created, and comment on every possible entry with error Whether it’s a mapping issue or a pointing issue

To start with 1-I have a strong feeling the map used is not the best match for this dataset, as I can see in the examples provided by Ronald the sun letters rules are applied to the results, while the map used, doesn’t have sun letters

2-I can see that the output contains the final letter diacritic, which might/might not be omitted, based on the sentence Almost similar to how last letters in french are omitted sometimes in pronunciation — this can be modified in the maps, as a rule to optionally omit the three main diacritics (fatha-damma-kasragh) if they’re on the last letters of the word

On 1 Aug 2021, at 5:20 PM, gilgameshjw @.***> wrote: Analysis and Answers:

GNDB dataset? I am using : GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv

transliteration systems:

interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1 reproduce output: We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we ran, in part under /python in rababa. We also added analysis files. analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

gilgameshjw commented 3 years ago

@AhMohsen46 @ronaldtse mentionned on skype (which you might not be able to access) that possibly, the GNDB diacritized data itself could be bad.

Maybe we can investigate on that?

ronaldtse commented 3 years ago

@AhMohsen46 as mentioned by @gilgameshjw , the GNDB datasets may contain mis-tagged transliterations. e.g. most of the Arabic transliterations should actually be BGN/PCGN, but some may be mis-tagged as ALA-LC.

Part of our work here is also find a good way to detect mis-tagged transliteration. The new "detect" feature in Interscript should help.

interscript / maps

Problem with Arabic transliteration #7

Analysis and Answers: