interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Test against GeoNames data (Japan) #50

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

ja.zip

Uses these systems:

webdev778 commented 8 months ago

https://github.com/interscript/geotest/issues/1

For this database, GeoTest outputs the following result:

# bundle exec ruby test.rb files/ja/ja.txt
.....
0 records have a non-unique UNI (should be 0)

Out of 218205 related clusters we get 72883 unique related clusters
Unique clusters have 218205 members in total (this should match a number of related clusters)
Hash of cluster length to a number of clusters of that kind: {4=>33174, 2=>33703, 3=>5946, 1=>15, 6=>17, 5=>24, 7=>4}

Transliteration systems used:
- "" * 195555 (147590 with a pair)
- "jpn_Hrkt2Latn_BGN_1930" * 93505 (69392 with a pair)
- "jpn_Hrkt2Latn_GJP_1954" * 1481 (65 with a pair)
- "rus_Cyrl2Latn_BGN_1947" * 1106 (858 with a pair) implemented in Interscript as bgnpcgn-rus-Cyrl-Latn-1947
- "NOT_TRANSLITERATED" * 174 (4 with a pair)
- "jpn_Hrkt2Latn_ALA_1997" * 42 (22 with a pair)
- "jav_Java2Latn_ALA_1997" * 17 (6 with a pair)
- "fas_Arab2Latn_AMMI_1959" * 1 (1 with a pair)
- "amh_Ethi2Latn_BGN_1967" * 1 (0 with a pair) implemented in Interscript as bgnpcgn-amh-Ethi-Latn-1967
- "hin_Deva2Latn_ALA_1997" * 1 (1 with a pair) implemented in Interscript as alalc-hin-Deva-Latn-1997

Among the unique clusters:
- 15 clusters are too short
- 2 clusters contain no non-ASCII entries
- 2544 clusters contain no transliteration info
- 36463 clusters contain more than 1 non-ASCII entries
- 33005 clusters are transliterated with a map not present in Interscript
Remaining 854 clusters seem to be usable

rus_Cyrl2Latn_BGN_1947: 841/853 (98.59%) (Errors: Incorrect punctuation * 5, Incorrect transliteration * 7)
hin_Deva2Latn_ALA_1997: 0/1 (0.0%) (Errors: Incorrect transliteration * 1)