Tracking issue for the GeoTest tool

I have implemented an extensible Ruby script using Interscript: https://github.com/interscript/geotest/ . Unfortunately, dealing with that file felt a lot like scraping, since there is a lot of incorrect data provided.

I have noticed a couple of things, a row contains some useful fields:

UNI - it is a unique name entry (globally unique within a file)
UFI - it is a unique place identifier

Instead of using UFI, since a lot of entries in that file are coming from multiple languages and LC (language) field is not consistent I have decided to use NAME_LINK. As you have noted, this field most of the time creates a bidirectional connection, but this is not a rule. In some cases it links to an non-existent field, in other cases, it creates a cluster of even up to 7 entries (in this particular database, up to 4). Most of the clusters are just 2 entries.

So, in our case, I decided to work with clusters of NAME_LINK/UFI and assume that each cluster contains 1 source row (NS,VS,DS) and a number of transliterated rows. This assumption wasn't 100% correct, but it is some starting point. This assumption can be upgraded in the future - for that I have made this script output a lot of statistics.

As some future improvement, some of those databases (most notably the Ukrainian one) contain no information of the transliteration system (8400 clusters out of 28968 contain no transliteration info) - for those we could try to deduce the system used using a feature of Interscript.

The script can be ran with a -v switch to display each and every failure, including both the original and the transliterated name entries.

Originally posted by @webdev778 in https://github.com/interscript/interscript-ruby/issues/38#issuecomment-1923357020

interscript / geotest

Tracking issue for the GeoTest tool #1