pelias / whosonfirst

Importer for Who's on First gazetteer
MIT License
26 stars 42 forks source link

fix(doc): Do not allow duplicate names to be created #511

Closed orangejulius closed 3 years ago

orangejulius commented 4 years ago

While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with https://github.com/pelias/model/pull/132), we could also fix this issue at the source.

Here's an example of what a document might look like today, before this PR:

{
  name: { default: [ 'Kansas City' ] },
  phrase: { default: [ 'Kansas City', 'Kansas City' ] },
  ...
}

This can be fixed by checking each potential alternate name against the "primary" name value.

We might not want to merge this PR, since it only fixes the issue in this repo, but it might also be nice to test this change in a single repository first.

orangejulius commented 4 years ago

Just some commentary, this issue was super hard to track down!

I was looking at an issue where the much more populated and well known Kansas City, MO was being ranked below Kansas City, KS.

The documents were identical, except the score showed that matches on the phrase field were being adjusted based on a field length of 4 for the Kansas City, MO record (from WOF), but only 2 for the Kansas City, KS record (from geonames).

I had to dig into the documents generated by both importers to learn that the difference was in duplicate values in the phrase field. https://github.com/pelias/schema/issues/285 to allow us to stop using a hidden phrase field can't come soon enough!!

Looking back, we've often been confused as to why Geonames records for a given admin area seem to be preferred, and this might be the reason! So hopefully results will be much better with this PR and/or https://github.com/pelias/model/pull/132