interscript / geonames-transliteration-data

GeoNames data parsed into transliteration pairs
2 stars 0 forks source link

Detecting transliteration systems used in GeoNames data set #3

Open ronaldtse opened 4 years ago

ronaldtse commented 4 years ago

The GeoNames data set contains entries like these:

Row 2 here writes "西博寮海峽", which has a NAME_LINK entry to Row 4 "West Lamma Channel". However, while they are both names of the same geographic location, they are not related in transliteration. The actual transliterated row of Row 2 is Row 1, "Sai Puk Liu Hoi Hap".

Screen Shot 2020-01-12 at 3 14 53 PM

There are two problems here:

  1. Row 1 should have NAME_LINK pointing to Row 2 (i.e. its NAME_LINK should be -1950489, because Row 2 has this UID and NAME_LINK is supposed to be bi-directional) and should have TRANSL_CD code set to the Cantonese transliteration system because it is generated by transliterating Row 2.

  2. Row 3 is also generated by Row 2, and should have TRANSL_CD code set to the Mandarin transliteration system because it is generated by transliterating Row 2. However, it is unclear what it should be set to because NAME_LINK seems to only support pairing of two entities, not a one-to-many relationship.

The point in this task is to detect that Row 3 comes from Row 2, detect the transliteration system, and pair them in the output we produce.

ronaldtse commented 4 years ago

Maybe the problem can be worded simply as this:

Given:

Detect which transliteration systems generated those destination strings.

I guess the easy way is to run all transliteration systems on all source strings and match the output of the destination strings.