pelias / model

Pelias data models
6 stars 17 forks source link

fix(deduplication): Deduplicate values in phrase field #132

Closed orangejulius closed 4 years ago

orangejulius commented 4 years ago

https://github.com/pelias/model/pull/118 added support for removing duplicate values from the name field. This logic was not also applied to the phrase field.

Duplicate values do not affect whether or not a particular document will match for a given query, but they do affect the scoring.

In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result.

To make sure our scoring is as fair as possible (pending other issues such as https://github.com/pelias/openstreetmap/issues/507), we should apply our current deduplication on both the name and phrase fields.

orangejulius commented 4 years ago

While we definitely want to merge this PR, also be sure to read the discussion over at https://github.com/pelias/whosonfirst/pull/511 before doing so. I'd like to do some testing before rolling it out everywhere as well.

orangejulius commented 4 years ago

I realized this could help slightly mitigate the problems from https://github.com/pelias/openstreetmap/issues/507 with OSM venues, so I'm just going to merge it and roll it out everywhere. Hopefully we see some improvements!