Test against GeoNames data (Ukranian)

ronaldtse commented 4 years ago

The Ukraine GeoNames data can be found here (originally downloaded from http://geonames.nga.mil/gns/html/namefiles.html):

up.zip

These are TSV (tab-separated values) files encoded in UTF-8 format. I couldn't figure out how to read them in Excel but Pages opened them happily, so here's a screenshot from Pages:

The schema for these files is provided here: http://geonames.nga.mil/gns/html/gis_countryfiles.html

Specifically, the columns relevant to us are:

MGRS. The unique column per location.
NT. Name type.
LC. Language code.
SHORT FORM; GENERIC. These aren't always filled in.
SORT_NAME_RO. This is generated from FULL_NAME_RO according to the definition provided below.
FULL_NAME_RO
FULL_NAME_ND_RO. This is generated from FULL_NAME_RO but stripped off of diacritics.
SORT_NAME_RG. This is generated from FULL_NAME_RG according to the definition provided below.
FULL_NAME_RG
FULL_NAME_ND_RG. This is generated from FULL_NAME_RG but stripped off of diacritics.
NAME_LINK. This is supposed to connect the "generated" (transliterated) Roman names from their non-Roman name equivalents. It says "vice-versa" so probably only supports pairing of names, but it is clear that there are multiple ways of transliterating a single non-Roman name.

TRANSL_CD. Transliteration Code. This is technically important, but it is often empty in the dataset (this is not shown in the example since it is empty and also difficult to screenshot in the same screen)

In summary:

only the NS row is source data, and only the columns FULL_NAME_RO and FULL_NAME_RG are source data within that row.

Example 1

The NS row is the name in the original language/script (LC says ukr for Ukranian).
The "V | rus" and "N | ukr" rows are technically generated.

e.g. (FULL_NAME_RO) "Пашківці" => "Pashkovtsy", "Pashkivtsi"

Our goals:

ensure the "N | ukr" row can be identically generated from the "NS | ukr" row.
attempt to generate the "V | rus" row from the "NS | ukr" row.

Example 2

There are 3 "V" rows, 1 "N" row and 1 "NS" row. The 3 "V" rows and 1 "N" row are generated. All of these 4 rows are generated from a different transliteration system.

e.g. (FULL_NAME_RO) "Біласовиця" => "Bilasovitsy", "Bilasovytsya", "Belosovitsa", "Belasovitsa"

Our goals:

ensure the "N" and "V" rows can be identically generated from the "NS | ukr" row.
attempt to generate the "V | rus" row from the "NS | ukr" row.

Example 3

This example contains TRANSL_CD.

With TRANSL_CD:

(Using FULL_NAME_RO) The "NS | ukr" value "Кам’янка-Дніпровська" is generated into "N | ukr" "Kam”yanka-Dniprovs’ka" using the "ukr_Cyrl2Latn_BGN_1965" system.

The "VS | rus" value "Каменка-Днепровская" is generated into "V | rus" "Kamenka-Dneprovskaya" using the "rus_Cyrl2Latn_BGN_1947" system.

Example 3

TRANSL_CD:

Similar to previous examples. Notice that the "N | ukr" row uses the ukr_Cyrl2Latn_GUP_1996 transliteration system.

Regarding transliteration system codes

In addition to the ukr_Cyrl2Latn_BGN_1965 and rus_Cyrl2Latn_BGN_1947 systems used throughout the file, there are 2 rows with TRANSL_CD of NOT_TRANSLITERATED, 1 row of ukr_Cyrl2Latn_GUP_1996, 10 rows of ukr_Cyrl2Latn_ALA_1997, 5 rows of tuk_Cyrl2Latn_BGN_1979.

I believe the system we have implemented for BGN/PCGN is ukr_Cyrl2Latn_BGN_1965.

We will have to implement the remaining systems to ensure the generated transliteration fulfills the requirements of this file.

Direction

We will have to:

Write a script to extract out the important columns from this TSV file
Then test the language/script pairs against our transliteration systems.

webdev778 commented 5 months ago

To solve this issue, I have implemented an extensible Ruby script using Interscript: https://github.com/interscript/geotest/ . Unfortunately, dealing with that file felt a lot like scraping, since there is a lot of incorrect data provided.

I have noticed a couple of things, a row contains some useful fields:

UNI - it is a unique name entry (globally unique within a file)
UFI - it is a unique place identifier Instead of using UFI, since a lot of entries in that file are coming from multiple languages and LC (language) field is not consistent I have decided to use NAME_LINK. As you have noted, this field most of the time creates a bidirectional connection, but this is not a rule. In some cases it links to an non-existent field, in other cases, it creates a cluster of even up to 7 entries (in this particular database, up to 4). Most of the clusters are just 2 entries.

So, in our case, I decided to work with clusters of NAME_LINK/UFI and assume that each cluster contains 1 source row (NS,VS,DS) and a number of transliterated rows. This assumption wasn't 100% correct, but it is some starting point. This assumption can be upgraded in the future - for that I have made this script output a lot of statistics.

As some future improvement, some of those databases (most notably the Ukrainian one) contain no information of the transliteration system (8400 clusters out of 28968 contain no transliteration info) - for those we could try to deduce the system used using a feature of Interscript.

The script can be ran with a -v switch to display each and every failure, including both the original and the transliterated name entries.

(I will copy this post to an issue in GeoTest repository for tracking purposes and in the next post include a result from a run on the attached database).

webdev778 commented 5 months ago

interscript/geotest#1

# bundle exec ruby test.rb files/up/up.txt 
.....
0 records have a non-unique UNI (should be 0)

Out of 58000 related clusters we get 28968 unique related clusters
Unique clusters have 58000 members in total (this should match a number of related clusters)
Hash of cluster length to a number of clusters of that kind: {3=>42, 2=>28915, 4=>11}

Transliteration systems used:
- "" * 86568 (37356 with a pair)
- "ukr_Cyrl2Latn_BGN_1965" * 19179 (18552 with a pair) implemented in Interscript as bgnpcgn-ukr-Cyrl-Latn-1965
- "rus_Cyrl2Latn_BGN_1947" * 8164 (1979 with a pair) implemented in Interscript as bgnpcgn-rus-Cyrl-Latn-1947
- "NOT_TRANSLITERATED" * 137 (0 with a pair)
- "ukr_Cyrl2Latn_ALA_1997" * 28 (26 with a pair) implemented in Interscript as alalc-ukr-Cyrl-Latn-1997
- "bel_Cyrl2Latn_BGN_1979" * 13 (13 with a pair) implemented in Interscript as bgnpcgn-bel-Cyrl-Latn-1979
- "rus_Cyrl2Latn_ALA_1997" * 10 (9 with a pair) implemented in Interscript as alalc-rus-Cyrl-Latn-1997
- "tuk_Cyrl2Latn_BGN_1979" * 5 (0 with a pair)
- "ukr_Cyrl2Latn_GUP_1996" * 5 (5 with a pair) implemented in Interscript as ua-ukr-Cyrl-Latn-1996
- "ukr_Cyrl2Latn_ODNI_2005" * 2 (2 with a pair) implemented in Interscript as odni-ukr-Cyrl-Latn-2015
- "amh_Ethi2Latn_BGN_1967" * 2 (2 with a pair) implemented in Interscript as bgnpcgn-amh-Ethi-Latn-1967
- "hye_Armn2Latn_BGN_1981" * 1 (0 with a pair)
- "kat_Geor2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-kat-Geor-Latn-1997
- "urd_Arab2Latn_BGN_2007" * 1 (1 with a pair) implemented in Interscript as bgnpcgn-urd-Arab-Latn-2007
- "amh_Ethi2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-amh-Ethi-Latn-1997

Among the unique clusters:
- 0 clusters are too short
- 1 clusters contain no non-ASCII entries
- 8400 clusters contain no transliteration info
- 15 clusters contain more than 1 non-ASCII entries
- 0 clusters are transliterated with a map not present in Interscript
Remaining 20552 clusters seem to be usable

: 0/50 (0.0%) (Errors: No support in Interscript * 50)
ukr_Cyrl2Latn_BGN_1965: 18046/18515 (97.47%) (Errors: Incorrect transliteration * 385, Incorrect spacing or punctuation * 14, Incorrect punctuation * 60, Incorrect casing and punctuation * 2, Incorrect casing * 7, Incorrect casing and (spacing or punctuation) * 1)
bel_Cyrl2Latn_BGN_1979: 11/11 (100.0%)
rus_Cyrl2Latn_BGN_1947: 1806/1966 (91.86%) (Errors: Incorrect transliteration * 130, Incorrect casing and punctuation * 8, Incorrect spacing or punctuation * 5, Incorrect punctuation * 14, Incorrect casing * 3)
ukr_Cyrl2Latn_ALA_1997: 11/26 (42.31%) (Errors: Incorrect transliteration * 13, Incorrect punctuation * 2)
ukr_Cyrl2Latn_GUP_1996: 3/5 (60.0%) (Errors: Incorrect transliteration * 2)
rus_Cyrl2Latn_ALA_1997: 3/6 (50.0%) (Errors: Incorrect transliteration * 3)
ukr_Cyrl2Latn_ODNI_2005: 1/2 (50.0%) (Errors: Incorrect transliteration * 1)
amh_Ethi2Latn_BGN_1967: 0/2 (0.0%) (Errors: Incorrect transliteration * 2)
kat_Geor2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
urd_Arab2Latn_BGN_2007: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
amh_Ethi2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)

webdev778 commented 5 months ago

I see there's only 92% accuracy for the rus_Cyrl2Latn_BGN_1947 map. Yet in dataset of #52, this map has 98% accuracy.

interscript / interscript-ruby