interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Test against GeoNames data (Ukranian) #38

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

The Ukraine GeoNames data can be found here (originally downloaded from http://geonames.nga.mil/gns/html/namefiles.html):

up.zip

These are TSV (tab-separated values) files encoded in UTF-8 format. I couldn't figure out how to read them in Excel but Pages opened them happily, so here's a screenshot from Pages:

Screen Shot 2020-01-05 at 6 25 04 AM

The schema for these files is provided here: http://geonames.nga.mil/gns/html/gis_countryfiles.html

Specifically, the columns relevant to us are:

Screen Shot 2020-01-12 at 4 02 41 PM Screen Shot 2020-01-05 at 6 39 04 AM

In summary:

Example 1

Screen Shot 2020-01-05 at 6 41 37 AM

e.g. (FULL_NAME_RO) "Пашківці" => "Pashkovtsy", "Pashkivtsi"

Our goals:

Example 2

Screen Shot 2020-01-05 at 6 43 17 AM

There are 3 "V" rows, 1 "N" row and 1 "NS" row. The 3 "V" rows and 1 "N" row are generated. All of these 4 rows are generated from a different transliteration system.

e.g. (FULL_NAME_RO) "Біласовиця" => "Bilasovitsy", "Bilasovytsya", "Belosovitsa", "Belasovitsa"

Our goals:

Example 3

This example contains TRANSL_CD.

Screen Shot 2020-01-05 at 6 47 26 AM

With TRANSL_CD:

Screen Shot 2020-01-05 at 6 47 45 AM

(Using FULL_NAME_RO) The "NS | ukr" value "Кам’янка-Дніпровська" is generated into "N | ukr" "Kam”yanka-Dniprovs’ka" using the "ukr_Cyrl2Latn_BGN_1965" system.

The "VS | rus" value "Каменка-Днепровская" is generated into "V | rus" "Kamenka-Dneprovskaya" using the "rus_Cyrl2Latn_BGN_1947" system.

Example 3

Screen Shot 2020-01-05 at 6 55 10 AM

TRANSL_CD:

Screen Shot 2020-01-05 at 6 55 18 AM

Similar to previous examples. Notice that the "N | ukr" row uses the ukr_Cyrl2Latn_GUP_1996 transliteration system.

Regarding transliteration system codes

In addition to the ukr_Cyrl2Latn_BGN_1965 and rus_Cyrl2Latn_BGN_1947 systems used throughout the file, there are 2 rows with TRANSL_CD of NOT_TRANSLITERATED, 1 row of ukr_Cyrl2Latn_GUP_1996, 10 rows of ukr_Cyrl2Latn_ALA_1997, 5 rows of tuk_Cyrl2Latn_BGN_1979.

I believe the system we have implemented for BGN/PCGN is ukr_Cyrl2Latn_BGN_1965.

We will have to implement the remaining systems to ensure the generated transliteration fulfills the requirements of this file.

Direction

We will have to:

  1. Write a script to extract out the important columns from this TSV file
  2. Then test the language/script pairs against our transliteration systems.
webdev778 commented 5 months ago

To solve this issue, I have implemented an extensible Ruby script using Interscript: https://github.com/interscript/geotest/ . Unfortunately, dealing with that file felt a lot like scraping, since there is a lot of incorrect data provided.

I have noticed a couple of things, a row contains some useful fields:

So, in our case, I decided to work with clusters of NAME_LINK/UFI and assume that each cluster contains 1 source row (NS,VS,DS) and a number of transliterated rows. This assumption wasn't 100% correct, but it is some starting point. This assumption can be upgraded in the future - for that I have made this script output a lot of statistics.

As some future improvement, some of those databases (most notably the Ukrainian one) contain no information of the transliteration system (8400 clusters out of 28968 contain no transliteration info) - for those we could try to deduce the system used using a feature of Interscript.

The script can be ran with a -v switch to display each and every failure, including both the original and the transliterated name entries.

(I will copy this post to an issue in GeoTest repository for tracking purposes and in the next post include a result from a run on the attached database).

webdev778 commented 5 months ago

interscript/geotest#1

# bundle exec ruby test.rb files/up/up.txt 
.....
0 records have a non-unique UNI (should be 0)

Out of 58000 related clusters we get 28968 unique related clusters
Unique clusters have 58000 members in total (this should match a number of related clusters)
Hash of cluster length to a number of clusters of that kind: {3=>42, 2=>28915, 4=>11}

Transliteration systems used:
- "" * 86568 (37356 with a pair)
- "ukr_Cyrl2Latn_BGN_1965" * 19179 (18552 with a pair) implemented in Interscript as bgnpcgn-ukr-Cyrl-Latn-1965
- "rus_Cyrl2Latn_BGN_1947" * 8164 (1979 with a pair) implemented in Interscript as bgnpcgn-rus-Cyrl-Latn-1947
- "NOT_TRANSLITERATED" * 137 (0 with a pair)
- "ukr_Cyrl2Latn_ALA_1997" * 28 (26 with a pair) implemented in Interscript as alalc-ukr-Cyrl-Latn-1997
- "bel_Cyrl2Latn_BGN_1979" * 13 (13 with a pair) implemented in Interscript as bgnpcgn-bel-Cyrl-Latn-1979
- "rus_Cyrl2Latn_ALA_1997" * 10 (9 with a pair) implemented in Interscript as alalc-rus-Cyrl-Latn-1997
- "tuk_Cyrl2Latn_BGN_1979" * 5 (0 with a pair)
- "ukr_Cyrl2Latn_GUP_1996" * 5 (5 with a pair) implemented in Interscript as ua-ukr-Cyrl-Latn-1996
- "ukr_Cyrl2Latn_ODNI_2005" * 2 (2 with a pair) implemented in Interscript as odni-ukr-Cyrl-Latn-2015
- "amh_Ethi2Latn_BGN_1967" * 2 (2 with a pair) implemented in Interscript as bgnpcgn-amh-Ethi-Latn-1967
- "hye_Armn2Latn_BGN_1981" * 1 (0 with a pair)
- "kat_Geor2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-kat-Geor-Latn-1997
- "urd_Arab2Latn_BGN_2007" * 1 (1 with a pair) implemented in Interscript as bgnpcgn-urd-Arab-Latn-2007
- "amh_Ethi2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-amh-Ethi-Latn-1997

Among the unique clusters:
- 0 clusters are too short
- 1 clusters contain no non-ASCII entries
- 8400 clusters contain no transliteration info
- 15 clusters contain more than 1 non-ASCII entries
- 0 clusters are transliterated with a map not present in Interscript
Remaining 20552 clusters seem to be usable

: 0/50 (0.0%) (Errors: No support in Interscript * 50)
ukr_Cyrl2Latn_BGN_1965: 18046/18515 (97.47%) (Errors: Incorrect transliteration * 385, Incorrect spacing or punctuation * 14, Incorrect punctuation * 60, Incorrect casing and punctuation * 2, Incorrect casing * 7, Incorrect casing and (spacing or punctuation) * 1)
bel_Cyrl2Latn_BGN_1979: 11/11 (100.0%)
rus_Cyrl2Latn_BGN_1947: 1806/1966 (91.86%) (Errors: Incorrect transliteration * 130, Incorrect casing and punctuation * 8, Incorrect spacing or punctuation * 5, Incorrect punctuation * 14, Incorrect casing * 3)
ukr_Cyrl2Latn_ALA_1997: 11/26 (42.31%) (Errors: Incorrect transliteration * 13, Incorrect punctuation * 2)
ukr_Cyrl2Latn_GUP_1996: 3/5 (60.0%) (Errors: Incorrect transliteration * 2)
rus_Cyrl2Latn_ALA_1997: 3/6 (50.0%) (Errors: Incorrect transliteration * 3)
ukr_Cyrl2Latn_ODNI_2005: 1/2 (50.0%) (Errors: Incorrect transliteration * 1)
amh_Ethi2Latn_BGN_1967: 0/2 (0.0%) (Errors: Incorrect transliteration * 2)
kat_Geor2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
urd_Arab2Latn_BGN_2007: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
amh_Ethi2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
webdev778 commented 5 months ago

I see there's only 92% accuracy for the rus_Cyrl2Latn_BGN_1947 map. Yet in dataset of #52, this map has 98% accuracy.