interscript / interscript-ruby

Interoperable script conversion systems (ISCS) with the `interscript` gem
Other
11 stars 30 forks source link

Test against GeoNames data (North Korea) #49

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

kn.zip

Only uses these systems:

(Same as #48)

webdev778 commented 8 months ago

https://github.com/interscript/geotest/issues/1

For this database, GeoTest outputs the following result:


# bundle exec ruby test.rb files/kn/kn.txt
.....
0 records have a non-unique UNI (should be 0)

Out of 34883 related clusters we get 17425 unique related clusters
Unique clusters have 34883 members in total (this should match a number of related clusters)
Hash of cluster length to a number of clusters of that kind: {2=>17406, 3=>7, 4=>11, 6=>1}

Transliteration systems used:
- "" * 74776 (17773 with a pair)
- "kor_Hang2Latn_MR_1939" * 18217 (16229 with a pair) implemented in Interscript as bgn-kor-Hang-Latn-1943
- "kor_Hang2Latn_MOCT_2000" * 7907 (84 with a pair) implemented in Interscript as moct-kor-Hang-Latn-2000
- "zho_Hani2Latn_GCH_1979" * 4 (3 with a pair) implemented in Interscript as sac-zho-Hans-Latn-1979
- "zho_Hani2Latn_WDG_1979" * 2 (0 with a pair) implemented in Interscript as var-zho-Hani-Latn-wd-1979
- "rus_Cyrl2Latn_BGN_1947" * 1 (1 with a pair) implemented in Interscript as bgnpcgn-rus-Cyrl-Latn-1947

Among the unique clusters:
- 0 clusters are too short
- 1 clusters contain no non-ASCII entries
- 1128 clusters contain no transliteration info
- 15 clusters contain more than 1 non-ASCII entries
- 0 clusters are transliterated with a map not present in Interscript
Remaining 16281 clusters seem to be usable

kor_Hang2Latn_MR_1939: 3859/16187 (23.84%) (Errors: Incorrect punctuation * 11269, Incorrect transliteration * 967, Incorrect casing and (spacing or punctuation) * 67, Incorrect casing and punctuation * 5, Incorrect spacing or punctuation * 20)
kor_Hang2Latn_MOCT_2000: 31/84 (36.9%) (Errors: Incorrect transliteration * 23, Incorrect punctuation * 29, Incorrect spacing or punctuation * 1)
zho_Hani2Latn_GCH_1979: 0/3 (0.0%) (Errors: Incorrect casing and (spacing or punctuation) * 3)
rus_Cyrl2Latn_BGN_1947: 1/1 (100.0%)
: 0/10 (0.0%) (Errors: No support in Interscript * 10)