Closed orangejulius closed 4 years ago
Just some commentary, this issue was super hard to track down!
I was looking at an issue where the much more populated and well known Kansas City, MO
was being ranked below Kansas City, KS
.
The documents were identical, except the score showed that matches on the phrase field were being adjusted based on a field length of 4 for the Kansas City, MO
record (from WOF), but only 2 for the Kansas City, KS
record (from geonames).
I had to dig into the documents generated by both importers to learn that the difference was in duplicate values in the phrase
field. https://github.com/pelias/schema/issues/285 to allow us to stop using a hidden phrase
field can't come soon enough!!
Looking back, we've often been confused as to why Geonames records for a given admin area seem to be preferred, and this might be the reason! So hopefully results will be much better with this PR and/or https://github.com/pelias/model/pull/132
While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with https://github.com/pelias/model/pull/132), we could also fix this issue at the source.
Here's an example of what a document might look like today, before this PR:
This can be fixed by checking each potential alternate name against the "primary" name value.
We might not want to merge this PR, since it only fixes the issue in this repo, but it might also be nice to test this change in a single repository first.