= GeoNames data for transliteration testing
image:https://github.com/interscript/geonames-transliteration-data/workflows/build/badge.svg["Build Status", link="https://github.com/interscript/geonames-transliteration-data/actions?workflow=build"]
== Purpose
Extract transliteration pairs (entries that are coded with transliteration systems) for testing of those transliteration systems.
LAST UPDATED: See https://github.com/interscript/geonames-transliteration-data/releases[releases] page.
== Usage
=== Basic
Achieves all the steps below.
=== Check latest release date of GNDB
Due to recent instabilities of pages that serve GNDB releases, the make checkdate
target is created in order to fetch the latest release date (once or
multiple times) to ensure that a consistent version is served across all GNDB
load balancers.
make checkdate
All dates displayed should be identical. If interleaving entries are shown, there is an issue with endpoint consistency.
=== Create the GeoNames database for filtering
Create a SQLite3 database using the "all countries
" GeoNames data set.
Internally it will run this:
=== Extract all transliteration pairs first
You will get geonames_pairs.csv
which contains all transliteration pairs.
=== Extracting for every transliteration system
Your output files will be stored in pairs/${translit_system}.csv
and look like this:
== Column description
=== TRANSL_CD
The transliteration system used to generate the DEST_* name.
=== LC
Language code.
=== {SRC, DEST}_UNI
ID of the source/destination name.
=== {SRC, DEST}_NT
Type of source/destination name.
=== SRC_FULLNAME{RO,RG}
Source name. RO => full name. RG => name with grouped feature, e.g. "Lake Nemo" in RO will be formatted as "Nemo, Lake" in RG.
=== NT column values
In the *_NT
columns, rows with values DS
, NS
, VS
are always the name source, the rest are generated.
The meaning of *_NT
values are described here:
http://geonames.nga.mil/gns/html/rest/lookuptables.html#Name%20Type%20Codes
|=== | NT_CD | DESCRIPTION | DEFINITION
| C | Conventional | A commonly used English-language name approved by the U.S. Board on Geographic Names (BGN) for use in addition to, or in lieu of, a BGN-approved local official name or names, e.g., Rome, Alps, Danube River. | D | Unverified | A name from a source whose official status can not be verified by the BGN. | DS | Unverified Non-Roman Script | The non-Roman script form of a name from a source whose official status can not be verified by the BGN. | N | Approved | The BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country. | NS | Non-Roman Script | The non-Roman script form of the BGN-approved local official name for a geographic feature. Except for countries with more than one official language; there is normally only one such name for a feature wholly within a country. | P | Provisional | A geographic name of an area for which the territorial status is not finally determined or not recognized by the United States. | V | Variant | A former name, name in local usage, or other spelling found on various sources. | VA | Anglicized Variant | An English-language name that is derived by modifying the local official name to render it more accessible or meaningful to an English-language user. | VS | Variant Non-Roman Script | The non-Roman script form of a former name, name in local usage, or other spelling found on various sources.
|===
== Credits
Copyright Ribose.