osm-search / Nominatim

Open Source search based on OpenStreetMap data
https://nominatim.org
GNU General Public License v3.0
3.2k stars 715 forks source link

Making Nominatim's geocoder laxer on unknown words / invalid tokens and return a match score #145

Open tommedema opened 10 years ago

tommedema commented 10 years ago

I have a specific purpose for Nominatim's geocoder where long and detailed address strings are geocoded. These address strings contain some elements that are sometimes not recognized by Nominatim, while other elements are.

For example, the string may contain a building name, floor number and unit number that are all unknown -- while it also contains a street address that is known.

At Nominatim's current state, the above situation would result in 0 results, even though the street address is known. As an example, try a search on "gebouw A verdieping 2 melkweg 24 groningen" (Dutch for Building A Level 2 Melkweg 24 Groningen).

For my purpose, the words "gebouw A verdieping 2" should not cause the query to return no results. Rather it should return the result that matches my original search query the most, in this case "melkweg 24 groningen".

I already noticed that I can achieve much of this by uncommenting the following in Geocoder.php:

else
{
    // Allow skipping a word - but at EXTREAM cost
    //$aSearch = $aCurrentSearch;
    //$aSearch['iSearchRank']+=100;
    //$aNewWordsetSearches[] = $aSearch;
}

And setting $aSearch['iSearchRank']+=100; to something more realistic like $aSearch['iSearchRank']+=5 (else no results are returned anyway).

However, there are two issues:

  1. in my search results, I am missing an important indicator on how much the result is in accordance with or matches the search query. This is because items are given an importance not just based on the degree to which they match the search string, but also by the "importance" of the node and potentially by a Wikipedia article. However, I need to know only how much the result is relevant to the original search query.
  2. when an invalid token is found (the word is not recognized), an arbitrary iSearchRank increment of 100 is performed. There seems to be a limit at which results with a very high iSearchRank are not returned, and therefore most queries with invalid tokens are never returned -- even when uncommenting the above code -- unless one decreases the number 100. My question is what range of numbers represent senseful values for the iSearchRank to be incremented with when an invalid token is found. Moreover, where can I increase the rank limit such that "unimportant" results are still returned?

Thanks,

Tom

lonvia commented 10 years ago

Regarding 1), you'll have to add a custom return field. Exact string matches are computed here. Simply count them, add a field to $aResult and return the field in one of the lib/template/search-*.php`.

Regarding 2), that happens around here.

General word of warning: this code hasn't really been tested. There is a good chance that you get a lot of false matches.

tommedema commented 10 years ago

Thanks. Do you have any interest in integrating such "partial match" feature into the main branch? Perhaps with a boolean parameter that defaults to false? If so I could create a pull request for my changes.

lonvia commented 10 years ago

Certainly. If you can get this to work in a way that the results are meaningful that would be awesome. It would be good if it was possible to enable/disable on the server-side (i.e. with CONST_something), just in case there are performance issues.

danieldegroot2 commented 7 months ago

Similar to #3059, Nominatim finds "NASA Marshall Space Flight Center", not "NASA's Marshall Space Flight Center" with possessive 's. (with NASA being operator/brand but also included in common name.) if 's is not explicitly included. https://www.openstreetmap.org/way/1160794644

Regular

with possessive 's.

Understandable if this is known/hard issue / not planned.

matkoniecz commented 5 months ago

One more example, where adding apartment number causes query to completely fail - and without it query works fine:

https://www.openstreetmap.org/search?query=Centrum%20B%207%2F20%2C%20krak%C3%B3w#map=19/57.71158/11.96865

https://www.openstreetmap.org/search?query=Centrum%20B%207%2C%20krak%C3%B3w#map=19/50.07437/20.04077


And one more, where presence of post code breaks search

https://www.openstreetmap.org/search?query=Stefana%20%C5%BBeromskiego%20114%2C%2090%2D543%20%C5%81%C3%B3d%C5%BA#map=16/51.7538/19.4514

https://www.openstreetmap.org/search?query=Stefana%20%C5%BBeromskiego%20114%2C%20%C5%81%C3%B3d%C5%BA#map=19/51.75453/19.44963