interscript / geonames-transliteration-data

GeoNames data parsed into transliteration pairs
2 stars 0 forks source link

= GeoNames data for transliteration testing

image:https://github.com/interscript/geonames-transliteration-data/workflows/build/badge.svg["Build Status", link="https://github.com/interscript/geonames-transliteration-data/actions?workflow=build"]

== Purpose

Extract transliteration pairs (entries that are coded with transliteration systems) for testing of those transliteration systems.

LAST UPDATED: See https://github.com/interscript/geonames-transliteration-data/releases[releases] page.

== Usage

=== Basic

[source,bash]

make all

Achieves all the steps below.

=== Check latest release date of GNDB

Due to recent instabilities of pages that serve GNDB releases, the make checkdate target is created in order to fetch the latest release date (once or multiple times) to ensure that a consistent version is served across all GNDB load balancers.

[source,bash]

Check once

make checkdate

Check 10 times

CHECKDATE_TIMES=10 make checkdate

All dates displayed should be identical. If interleaving entries are shown, there is an issue with endpoint consistency.

=== Create the GeoNames database for filtering

Create a SQLite3 database using the "all countries" GeoNames data set.

[source,bash]

make geonames.db

Internally it will run this:

[source,bash]

$ curl -sL http://geonames.nga.mil/gns/html/cntyfile/geonames_20200106.zip -o geonames.zip $ unzip geonames.zip -d data $ sqlite sqlite> .mode tabs sqlite> .import geonames_20200106/Countries.txt countries sqlite> .backup geonames.db sqlite> exit

=== Extract all transliteration pairs first

[source,bash]

make geonames_pairs.db

You will get geonames_pairs.csv which contains all transliteration pairs.

[source,csv]

TRANSL_CD|SRC_UNI|SRC_NT|SRC_FULL_NAME_RO|SRC_FULL_NAME_RG|DEST_UNI|DEST_NT|DEST_FULL_NAME_RO|DEST_FULL_NAME_RG|LC zho_Hani2Latn_WDG_1979|475320|D|Qiantangjiang Zhan|Qiantangjiang Zhan|12277331|DS|钱塘江站|钱塘江站|zho zho_Hani2Latn_WDG_1979|14673196|V|Tung-men Chiao|Tung-men Chiao|14633249|VS|东门礁|东门礁|zho zho_Hani2Latn_WDG_1979|14673197|V|Hsi-men Chiao|Hsi-men Chiao|14633249|VS|东门礁|东门礁|zho ...

=== Extracting for every transliteration system

[source,bash]

make all

Your output files will be stored in pairs/${translit_system}.csv and look like this:

[source,csv]

TRANSL_CD,SRC_UNI,SRC_NT,SRC_FULL_NAME_RO,SRC_FULL_NAME_RG,DEST_UNI,DEST_NT,DEST_FULL_NAME_RO,DEST_FULL_NAME_RG,LC ara_Arab2Latn_BUC_2002,17550392,N,"Wādī al Khuḑrah","Wādī al Khuḑrah",17550393,NS,"وادي الخضرة","وادي الخضرة",ara ara_Arab2Latn_BUC_2002,-3489887,D,"Wādī ad Dafīn","Wādī ad Dafīn",14461283,DS,"وادي الدفين","وادي الدفين",ara ara_Arab2Latn_BUC_2002,18101655,N,"Shi‘b Ḩāzirin","Ḩāzirin, Shi‘b",18101656,NS,"شعب حازرن","شعب حازرن",ara ara_Arab2Latn_BUC_2002,14363956,N,"Būghāz Raqm Ithnān","Raqm Ithnān, Būghāz",14305791,NS,"بوغاز رقم ٢","بوغاز رقم ٢",ara ...

== Column description

=== TRANSL_CD

The transliteration system used to generate the DEST_* name.

=== LC

Language code.

=== {SRC, DEST}_UNI

ID of the source/destination name.

=== {SRC, DEST}_NT

Type of source/destination name.

=== SRC_FULLNAME{RO,RG}

Source name. RO => full name. RG => name with grouped feature, e.g. "Lake Nemo" in RO will be formatted as "Nemo, Lake" in RG.

=== NT column values

In the *_NT columns, rows with values DS, NS, VS are always the name source, the rest are generated.

The meaning of *_NT values are described here: http://geonames.nga.mil/gns/html/rest/lookuptables.html#Name%20Type%20Codes

|=== | NT_CD | DESCRIPTION | DEFINITION

| C | Conventional | A commonly used English-language name approved by the U.S. Board on Geographic Names (BGN) for use in addition to, or in lieu of, a BGN-approved local official name or names, e.g., Rome, Alps, Danube River. | D | Unverified | A name from a source whose official status can not be verified by the BGN. | DS | Unverified Non-Roman Script | The non-Roman script form of a name from a source whose official status can not be verified by the BGN. | N | Approved | The BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country. | NS | Non-Roman Script | The non-Roman script form of the BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country. | P | Provisional | A geographic name of an area for which the territorial status is not finally determined or not recognized by the United States. | V | Variant | A former name, name in local usage, or other spelling found on various sources. | VA | Anglicized Variant | An English-language name that is derived by modifying the local official name to render it more accessible or meaningful to an English-language user. | VS | Variant Non-Roman Script | The non-Roman script form of a former name, name in local usage, or other spelling found on various sources.

|===

== Credits

Copyright Ribose.