pelias / openstreetmap

Import pipeline for OSM in to Pelias
MIT License
112 stars 72 forks source link

Detect altnames that are a substring of name.default #548

Open orangejulius opened 3 years ago

orangejulius commented 3 years ago

This change is an attempt to mitigate scoring penalties applied to documents with alternate names (https://github.com/pelias/openstreetmap/issues/507).

It handles the case where an alt name is merely a substring of the main name, for example on the Union Square subway stop in OSM:

image

Alt names like this don't add much value: they don't allow searching on any new terms, but do throw off the scoring. Even when we fix the scoring issue, duplicate alt names that add no value still take up space, so this change should be useful for a while.

The change comes in 2 parts, each in their own commit:

I'd be happy to extend this in the future with other near-identical alt names, such as handling Mc Donalds vs McDonalds or ignoring quotes or other special characters like in https://github.com/pelias/api/issues/1488.

orangejulius commented 2 years ago

I came across this PR today and wanted to see if it still made a difference, so I've rebased it and kicked off a planet build to test things out.