interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Test against GeoNames data (Russia) #52

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

Other than this file you have to download from the original site (http://geonames.nga.mil/gns/html/cntyfile/rs.zip)

rs_populatedplaces_p.txt.zip

These systems are used:

webdev778 commented 5 months ago

The linked file is not available anymore

webdev778 commented 5 months ago

https://github.com/interscript/geotest/issues/1

For the rs_populatedplaces_p.txt file, GeoTest outputs the following result:

# bundle exec ruby test.rb files/rs_populatedplaces_p.txt 
.....
0 records have a non-unique UNI (should be 0)

Out of 331214 related clusters we get 165598 unique related clusters
Unique clusters have 331214 members in total (this should match a number of related clusters)
Hash of cluster length to a number of clusters of that kind: {2=>165578, 3=>14, 1=>3, 4=>2, 5=>1}

Transliteration systems used:
- "" * 416664 (274164 with a pair)
- "rus_Cyrl2Latn_BGN_1947" * 22972 (20880 with a pair) implemented in Interscript as bgnpcgn-rus-Cyrl-Latn-1947
- "NOT_TRANSLITERATED" * 558 (439 with a pair)
- "che_Cyrl2Latn_BGN_2007" * 532 (337 with a pair)
- "rus_Cyrl2Latn_GOST_1983" * 242 (23 with a pair) implemented in Interscript as gost-rus-Cyrl-Latn-16876-71-1983
- "ukr_Cyrl2Latn_BGN_1965" * 69 (3 with a pair) implemented in Interscript as bgnpcgn-ukr-Cyrl-Latn-1965
- "UNKNOWN" * 1 (0 with a pair)
- "not_transliterated" * 1 (1 with a pair)
- "bel_Cyrl2Latn_BGN_1979" * 1 (1 with a pair) implemented in Interscript as bgnpcgn-bel-Cyrl-Latn-1979
- "rus_Cyrl2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-rus-Cyrl-Latn-1997

Among the unique clusters:
- 3 clusters are too short
- 1 clusters contain no non-ASCII entries
- 144294 clusters contain no transliteration info
- 3 clusters contain more than 1 non-ASCII entries
- 421 clusters are transliterated with a map not present in Interscript
Remaining 20876 clusters seem to be usable

rus_Cyrl2Latn_BGN_1947: 20614/20845 (98.89%) (Errors: Incorrect punctuation * 119, Incorrect transliteration * 111, Incorrect spacing or punctuation * 1)
rus_Cyrl2Latn_GOST_1983: 11/23 (47.83%) (Errors: Incorrect transliteration * 9, Incorrect punctuation * 3)
: 0/9 (0.0%) (Errors: No support in Interscript * 9)
ukr_Cyrl2Latn_BGN_1965: 3/3 (100.0%)
bel_Cyrl2Latn_BGN_1979: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
rus_Cyrl2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)
che_Cyrl2Latn_BGN_2007: 0/1 (0.0%) (Errors: No support in Interscript * 1)