interscript / geonames-transliteration-data

GeoNames data parsed into transliteration pairs
2 stars 0 forks source link

Some translations are not correct in mon_Cyrl2Latn_BGN_1964.csv #7

Open javkhaanj7 opened 3 years ago

javkhaanj7 commented 3 years ago

Here are couple of issues we need to fix:

  1. Mongolian uses only Cyrillic letters. There are 2 examples names written in kanji (Chinese) which are 吉兰泰镇 and 吉兰泰. Solution:

mon_Cyrl2Latn_BGN_1964,mon,17600805,NS,"吉兰泰镇","吉兰泰镇",17600738,N,"Jarantai Zhen","Jarantai Zhen" replace by: mon_Cyrl2Latn_BGN_1964,mon,17600805,NS,"Жарантай Сум","Жарантай Сум",17600738,N,"Jarantai Sum","Jarantai Sum"

mon_Cyrl2Latn_BGN_1964,mon,17600803,NS,"吉兰泰","吉兰泰",17600737,N,Jarantai,Jarantai replace by: mon_Cyrl2Latn_BGN_1964,mon,17600803,NS,"Жарантай","Жарантай",17600737,N,"Jarantai","Jarantai"

  1. Wrong translation for х character. Solution: mon_Cyrl2Latn_BGN_1964,mon,18973607,VS,"Сүхбаатар","Сүхбаатар",-3255713,V,"Sükhbaatar","Sükhbaatar" replace by: mon_Cyrl2Latn_BGN_1964,mon,18973607,VS,"Сүхбаатар","Сүхбаатар",-3255713,V,"Sühbaatar","Sühbaatar"

  2. Full names are in wrong order. Solution: mon_Cyrl2Latn_BGN_1964,mon,11005031,V,Orhon,Orhon,11005352,VS,"Орхон","Орхон" replace by: mon_Cyrl2Latn_BGN_1964,mon,11005031,V,"Орхон","Орхон",11005352,VS,Orhon,Orhon

ronaldtse commented 3 years ago

@javkhaanj7 this is good. Could you actually help go through the Mongolian database to ensure the transliteration data is correct? You have to go to http://geonames.nga.mil/ to download all names of Mongolia to check. Thanks!

ronaldtse commented 3 years ago

Issue 1 is described in #3 . We need to detect where the GeoNames database is incorrect in its mention of the transliteration system. e.g. it marks mon_Cyrl2Latn_BGN_1964 system for 吉兰泰镇 => Jarantai Zhen, that is actually the bgnpcgn-zho-Hans-Latn-1964 system in Interscript (so we need to make a mapping between OGC codes, see https://github.com/interscript/interscript/issues/527).

Issue 3 is described in #4. Full names are in wrong order perhaps due to a coding problem when we generate the smaller datasets, since the "pairs" CSV files were processed from the original database.