Open ronaldtse opened 4 years ago
Maybe the problem can be worded simply as this:
Given:
Detect which transliteration systems generated those destination strings.
I guess the easy way is to run all transliteration systems on all source strings and match the output of the destination strings.
The GeoNames data set contains entries like these:
Row 2 here writes "西博寮海峽", which has a
NAME_LINK
entry to Row 4 "West Lamma Channel". However, while they are both names of the same geographic location, they are not related in transliteration. The actual transliterated row of Row 2 is Row 1, "Sai Puk Liu Hoi Hap".There are two problems here:
Row 1 should have
NAME_LINK
pointing to Row 2 (i.e. itsNAME_LINK
should be-1950489
, because Row 2 has thisUID
andNAME_LINK
is supposed to be bi-directional) and should haveTRANSL_CD
code set to the Cantonese transliteration system because it is generated by transliterating Row 2.Row 3 is also generated by Row 2, and should have
TRANSL_CD
code set to the Mandarin transliteration system because it is generated by transliterating Row 2. However, it is unclear what it should be set to becauseNAME_LINK
seems to only support pairing of two entities, not a one-to-many relationship.The point in this task is to detect that Row 3 comes from Row 2, detect the transliteration system, and pair them in the output we produce.