I have implemented an extensible Ruby script using Interscript: https://github.com/interscript/geotest/ . Unfortunately, dealing with that file felt a lot like scraping, since there is a lot of incorrect data provided.
I have noticed a couple of things, a row contains some useful fields:
UNI - it is a unique name entry (globally unique within a file)
UFI - it is a unique place identifier
Instead of using UFI, since a lot of entries in that file are coming from multiple languages and LC (language) field is not consistent I have decided to use NAME_LINK. As you have noted, this field most of the time creates a bidirectional connection, but this is not a rule. In some cases it links to an non-existent field, in other cases, it creates a cluster of even up to 7 entries (in this particular database, up to 4). Most of the clusters are just 2 entries.
So, in our case, I decided to work with clusters of NAME_LINK/UFI and assume that each cluster contains 1 source row (NS,VS,DS) and a number of transliterated rows. This assumption wasn't 100% correct, but it is some starting point. This assumption can be upgraded in the future - for that I have made this script output a lot of statistics.
As some future improvement, some of those databases (most notably the Ukrainian one) contain no information of the transliteration system (8400 clusters out of 28968 contain no transliteration info) - for those we could try to deduce the system used using a feature of Interscript.
The script can be ran with a -v switch to display each and every failure, including both the original and the transliterated name entries.
to provide "correct" transliteration system codes for all the names in these data
the reports should be "per-country" so that the responsible party (which is organized per-country) can review/correct the data
The task is to:
for every country's data for every row, validate the transliteration system, and if incorrect or empty, detect the few closest transliteration system used
for each country's data, provide a CSV file with the content that contains the "corrections" that need to be made and the reason
I have implemented an extensible Ruby script using Interscript: https://github.com/interscript/geotest/ . Unfortunately, dealing with that file felt a lot like scraping, since there is a lot of incorrect data provided.
I have noticed a couple of things, a row contains some useful fields:
Instead of using UFI, since a lot of entries in that file are coming from multiple languages and LC (language) field is not consistent I have decided to use NAME_LINK. As you have noted, this field most of the time creates a bidirectional connection, but this is not a rule. In some cases it links to an non-existent field, in other cases, it creates a cluster of even up to 7 entries (in this particular database, up to 4). Most of the clusters are just 2 entries.
So, in our case, I decided to work with clusters of NAME_LINK/UFI and assume that each cluster contains 1 source row (NS,VS,DS) and a number of transliterated rows. This assumption wasn't 100% correct, but it is some starting point. This assumption can be upgraded in the future - for that I have made this script output a lot of statistics.
As some future improvement, some of those databases (most notably the Ukrainian one) contain no information of the transliteration system (8400 clusters out of 28968 contain no transliteration info) - for those we could try to deduce the system used using a feature of Interscript.
The script can be ran with a
-v
switch to display each and every failure, including both the original and the transliterated name entries.Originally posted by @webdev778 in https://github.com/interscript/interscript-ruby/issues/38#issuecomment-1923357020