Open ronaldtse opened 4 years ago
To solve this issue, I have implemented an extensible Ruby script using Interscript: https://github.com/interscript/geotest/ . Unfortunately, dealing with that file felt a lot like scraping, since there is a lot of incorrect data provided.
I have noticed a couple of things, a row contains some useful fields:
So, in our case, I decided to work with clusters of NAME_LINK/UFI and assume that each cluster contains 1 source row (NS,VS,DS) and a number of transliterated rows. This assumption wasn't 100% correct, but it is some starting point. This assumption can be upgraded in the future - for that I have made this script output a lot of statistics.
As some future improvement, some of those databases (most notably the Ukrainian one) contain no information of the transliteration system (8400 clusters out of 28968 contain no transliteration info) - for those we could try to deduce the system used using a feature of Interscript.
The script can be ran with a -v
switch to display each and every failure, including both the original and the transliterated name entries.
(I will copy this post to an issue in GeoTest repository for tracking purposes and in the next post include a result from a run on the attached database).
interscript/geotest#1
# bundle exec ruby test.rb files/up/up.txt
.....
0 records have a non-unique UNI (should be 0)
Out of 58000 related clusters we get 28968 unique related clusters
Unique clusters have 58000 members in total (this should match a number of related clusters)
Hash of cluster length to a number of clusters of that kind: {3=>42, 2=>28915, 4=>11}
Transliteration systems used:
- "" * 86568 (37356 with a pair)
- "ukr_Cyrl2Latn_BGN_1965" * 19179 (18552 with a pair) implemented in Interscript as bgnpcgn-ukr-Cyrl-Latn-1965
- "rus_Cyrl2Latn_BGN_1947" * 8164 (1979 with a pair) implemented in Interscript as bgnpcgn-rus-Cyrl-Latn-1947
- "NOT_TRANSLITERATED" * 137 (0 with a pair)
- "ukr_Cyrl2Latn_ALA_1997" * 28 (26 with a pair) implemented in Interscript as alalc-ukr-Cyrl-Latn-1997
- "bel_Cyrl2Latn_BGN_1979" * 13 (13 with a pair) implemented in Interscript as bgnpcgn-bel-Cyrl-Latn-1979
- "rus_Cyrl2Latn_ALA_1997" * 10 (9 with a pair) implemented in Interscript as alalc-rus-Cyrl-Latn-1997
- "tuk_Cyrl2Latn_BGN_1979" * 5 (0 with a pair)
- "ukr_Cyrl2Latn_GUP_1996" * 5 (5 with a pair) implemented in Interscript as ua-ukr-Cyrl-Latn-1996
- "ukr_Cyrl2Latn_ODNI_2005" * 2 (2 with a pair) implemented in Interscript as odni-ukr-Cyrl-Latn-2015
- "amh_Ethi2Latn_BGN_1967" * 2 (2 with a pair) implemented in Interscript as bgnpcgn-amh-Ethi-Latn-1967
- "hye_Armn2Latn_BGN_1981" * 1 (0 with a pair)
- "kat_Geor2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-kat-Geor-Latn-1997
- "urd_Arab2Latn_BGN_2007" * 1 (1 with a pair) implemented in Interscript as bgnpcgn-urd-Arab-Latn-2007
- "amh_Ethi2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-amh-Ethi-Latn-1997
Among the unique clusters:
- 0 clusters are too short
- 1 clusters contain no non-ASCII entries
- 8400 clusters contain no transliteration info
- 15 clusters contain more than 1 non-ASCII entries
- 0 clusters are transliterated with a map not present in Interscript
Remaining 20552 clusters seem to be usable
: 0/50 (0.0%) (Errors: No support in Interscript * 50)
ukr_Cyrl2Latn_BGN_1965: 18046/18515 (97.47%) (Errors: Incorrect transliteration * 385, Incorrect spacing or punctuation * 14, Incorrect punctuation * 60, Incorrect casing and punctuation * 2, Incorrect casing * 7, Incorrect casing and (spacing or punctuation) * 1)
bel_Cyrl2Latn_BGN_1979: 11/11 (100.0%)
rus_Cyrl2Latn_BGN_1947: 1806/1966 (91.86%) (Errors: Incorrect transliteration * 130, Incorrect casing and punctuation * 8, Incorrect spacing or punctuation * 5, Incorrect punctuation * 14, Incorrect casing * 3)
ukr_Cyrl2Latn_ALA_1997: 11/26 (42.31%) (Errors: Incorrect transliteration * 13, Incorrect punctuation * 2)
ukr_Cyrl2Latn_GUP_1996: 3/5 (60.0%) (Errors: Incorrect transliteration * 2)
rus_Cyrl2Latn_ALA_1997: 3/6 (50.0%) (Errors: Incorrect transliteration * 3)
ukr_Cyrl2Latn_ODNI_2005: 1/2 (50.0%) (Errors: Incorrect transliteration * 1)
amh_Ethi2Latn_BGN_1967: 0/2 (0.0%) (Errors: Incorrect transliteration * 2)
kat_Geor2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
urd_Arab2Latn_BGN_2007: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
amh_Ethi2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
I see there's only 92% accuracy for the rus_Cyrl2Latn_BGN_1947 map. Yet in dataset of #52, this map has 98% accuracy.
The Ukraine GeoNames data can be found here (originally downloaded from http://geonames.nga.mil/gns/html/namefiles.html):
up.zip
These are TSV (tab-separated values) files encoded in UTF-8 format. I couldn't figure out how to read them in Excel but Pages opened them happily, so here's a screenshot from Pages:
The schema for these files is provided here: http://geonames.nga.mil/gns/html/gis_countryfiles.html
Specifically, the columns relevant to us are:
MGRS. The unique column per location.
NT. Name type.
LC. Language code.
SHORT FORM; GENERIC. These aren't always filled in.
SORT_NAME_RO. This is generated from FULL_NAME_RO according to the definition provided below.
FULL_NAME_RO
FULL_NAME_ND_RO. This is generated from FULL_NAME_RO but stripped off of diacritics.
SORT_NAME_RG. This is generated from FULL_NAME_RG according to the definition provided below.
FULL_NAME_RG
FULL_NAME_ND_RG. This is generated from FULL_NAME_RG but stripped off of diacritics.
NAME_LINK. This is supposed to connect the "generated" (transliterated) Roman names from their non-Roman name equivalents. It says "vice-versa" so probably only supports pairing of names, but it is clear that there are multiple ways of transliterating a single non-Roman name.
In summary:
Example 1
ukr
for Ukranian).e.g. (FULL_NAME_RO) "Пашківці" => "Pashkovtsy", "Pashkivtsi"
Our goals:
Example 2
There are 3 "V" rows, 1 "N" row and 1 "NS" row. The 3 "V" rows and 1 "N" row are generated. All of these 4 rows are generated from a different transliteration system.
e.g. (FULL_NAME_RO) "Біласовиця" => "Bilasovitsy", "Bilasovytsya", "Belosovitsa", "Belasovitsa"
Our goals:
Example 3
This example contains TRANSL_CD.
With TRANSL_CD:
(Using FULL_NAME_RO) The "NS | ukr" value "Кам’янка-Дніпровська" is generated into "N | ukr" "Kam”yanka-Dniprovs’ka" using the "ukr_Cyrl2Latn_BGN_1965" system.
The "VS | rus" value "Каменка-Днепровская" is generated into "V | rus" "Kamenka-Dneprovskaya" using the "rus_Cyrl2Latn_BGN_1947" system.
Example 3
TRANSL_CD:
Similar to previous examples. Notice that the "N | ukr" row uses the
ukr_Cyrl2Latn_GUP_1996
transliteration system.Regarding transliteration system codes
In addition to the
ukr_Cyrl2Latn_BGN_1965
andrus_Cyrl2Latn_BGN_1947
systems used throughout the file, there are 2 rows with TRANSL_CD ofNOT_TRANSLITERATED
, 1 row ofukr_Cyrl2Latn_GUP_1996
, 10 rows ofukr_Cyrl2Latn_ALA_1997
, 5 rows oftuk_Cyrl2Latn_BGN_1979
.I believe the system we have implemented for BGN/PCGN is
ukr_Cyrl2Latn_BGN_1965
.We will have to implement the remaining systems to ensure the generated transliteration fulfills the requirements of this file.
Direction
We will have to: