Open ronaldtse opened 3 years ago
GNDB dataset? I am using : GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv
interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt
interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1
reproduce output: We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we ran, in part under /python in rababa. We also added analysis files.
analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt .
Hello, Here’s what am thinking I’ll take the file jair created, and comment on every possible entry with error Whether it’s a mapping issue or a pointing issue
To start with 1-I have a strong feeling the map used is not the best match for this dataset, as I can see in the examples provided by Ronald the sun letters rules are applied to the results, while the map used, doesn’t have sun letters
2-I can see that the output contains the final letter diacritic, which might/might not be omitted, based on the sentence Almost similar to how last letters in french are omitted sometimes in pronunciation — this can be modified in the maps, as a rule to optionally omit the three main diacritics (fatha-damma-kasragh) if they’re on the last letters of the word
On 1 Aug 2021, at 5:20 PM, gilgameshjw @.***> wrote: Analysis and Answers:
GNDB dataset? I am using : GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv
transliteration systems:
interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1 reproduce output: We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we ran, in part under /python in rababa. We also added analysis files. analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt .
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
@AhMohsen46 @ronaldtse mentionned on skype (which you might not be able to access) that possibly, the GNDB diacritized data itself could be bad.
Maybe we can investigate on that?
@AhMohsen46 as mentioned by @gilgameshjw , the GNDB datasets may contain mis-tagged transliterations. e.g. most of the Arabic transliterations should actually be BGN/PCGN, but some may be mis-tagged as ALA-LC.
Part of our work here is also find a good way to detect mis-tagged transliteration. The new "detect" feature in Interscript should help.
From @gilgameshjw 's run using GNDB data.
Clearly there is some difference in certain entries, if you look at 91 and 93, the transliteration system is different.
@gilgameshjw can you help confirm:
Method to easily reproduce this output? 😉 Thanks!