rinigus / geocoder-nlp

Geocoder library based on libpostal normalization of libosmscout generated database
MIT License
21 stars 1 forks source link

Bad results of address queries #60

Closed Olf0 closed 2 years ago

Olf0 commented 5 years ago

Basic description of this issue(s) at TMO (first observed in the context of a "speed comparison" between navigation apps).

Installed maps: BE, LU, NL and many parts of DE Languages used for address parsing: de, en, fr, lb, nl

OSM Scout Server's log always provides the same (seemingly correct) output while testing address searches (full session.log): INFO: 15:51:29 Request: /v2/search?search=Vorm+Baum+6 INFO: 15:52:00 Parsed query [DE]: house_number: {6}; road: {vorm baum}; INFO: 15:52:00 Parsed query [DE]: h-0: {vorm baum 6}; INFO: 15:52:33 Parsed query [NL]: house: {vorm}; house_number: {6}; road: {baum}; INFO: 15:52:33 Parsed query [NL]: h-0: {vorm baum 6}; INFO: 15:53:11 Parsed query [LU]: house: {vorm baum 6}; INFO: 15:53:11 Parsed query [LU]: h-0: {vorm baum 6}; INFO: 15:53:14 Parsed query [BE]: house: {vorm baum 6}; INFO: 15:53:14 Parsed query [BE]: h-0: {vorm baum 6};

A curl 'http://localhost:8553/v2/search?limit=500&search=Vorm+Baum+6' | fgrep '"admin_region":' | cut -s -f 2 -d ':' | cut -s -f 2 -d '"' | tee osmss_search-l500-Vorm+Baum+6.txt | wc -l results in 41 hits from the maps of the NL (40 hits) and a single DE state (the last hit).

Hence with the current limit for the number of hits of 25 (e.g. by using curl -o osmss_search-Vorm+Baum+6.txt 'http://localhost:8553/v2/search?search=Vorm+Baum+6'), only hits from a couple of addresses in NL (a few groups of extremely similar ones) are retrieved (the first 25 of the 41).

While increasing the limit of search hits may appear to be a "quick solution", I have made a couple of observations, which might lead to resolving this properly:

  1. Matching house_number seems to be too "fuzzy search"-style: When one looks for "6", also matching "6A" is good, but not matching e.g. "66" or "666". A suitable RegEx may be ^$search-string_house_number[!0-9].* for filtering house_numbers from the database. I assume this would generally reduce the number of results to less than 25.
  2. Of the many DE state maps downloaded, only a single one seems to be queried. This is actually the "right" one (accidentally?), providing the last and intended hit (number 41), and the one selected in the first line of OSM Scout Server's main window. But this may be unrelated (I have not tried looking at the code) and querying the maps of the other DE states for an address search is just not reported (yet 😉).
  3. It looks as if it might be helpful to order the search results depending on how "good" they fit to the original search string. This would have to be a "fuzzy search"-style match and I spontaneously have no idea of a RegEx (or at least a proper metric) for that.
  4. @peterleinchen also reported hits from BE for "Vorm Baum 6", which I do not see. Can anyone confirm this? See.
  5. [related to 2.] LU also provides no hits for me (just as BE; both might be correct), but I expected hits from the maps of other DE states (at least addresses as "close" to the original search string as the hits in NL).
rinigus commented 5 years ago

Let's go through this example.

All parsing is done country-level. So, for DE, we have parsing done with the first of its datasets and then reused. Hence the parsing for DE is reported once, I believe.

When you look at parsing results, BE and LU resolved only one level which has to be hit as a substring in the address. Hence no hits.

DE parsing has 2 levels, NL three with the hierarchy as shown (from smallest level to more general one).

Currently, all search results are sorted by number of levels that were caught in the database. If 2 levels were found, all results with 1 level found are discarded. In this case, NL had a chance since its essentially looked for "6, Baum" while DE was "6, Vorm Baum". As we have many streets named something like "Dr. John Brown" and people search for "brown", all streets with "Baum" in the beginning of one of the words were a hit.

There are few other ways results are sorted to ensure that city Glasgow will come before the pub with the same name. And only after that the closeness of match is found. That is probably a reason for putting DE results below NL ones (admin_levels was lower in NL than for DE).

As for 66 matching 6: we also have 6-2 and other combinations. So, not sure we can have very simple regex for it.

Right now, geocoder doesn't check which of the hierarchy levels matched the parsed string. While it wouldn't save us this time, I would have to look for better matching strategy. This would require major rewrite of the import, generation of training sets for libpostal NLP parser, and rewrite of the search. Hopefully, I can do that in 2019.

In addition, DE should get country as a part of the hierarchy. Then you could at least specify Germany in the string and get your perfect hit. Right now, the sub-territories do miss that information, unfortunately.

peterleinchen commented 5 years ago

BE was only searched but did not yield a result. Sorry for fud. Corrected my answer on TMO.

Olf0 commented 5 years ago

[...] As for 66 matching 6: we also have 6-2 and other combinations. So, not sure we can have very simple regex for it.

The one I suggested as a starting point (^$search-string_house_number[!0-9].*) avoids over-filtering (but fails finding "6-6" with the search string "6-") and can be extended.

Reducing the numbers of logically extremely similar, but not intended results (e.g. the house_number search string "6" currently matching all 6.*) to bring a larger variety of really different results into the top 25 hits IMO appears to be the easiest fix with the current state of the infrastructure (as the intended hit is indeed among the results in this example when using a higher query limit, but not found due to being at position 41).

Thank you for planning to rework the infrastructure in a larger timeframe. It is a bit unfortunate that a "speed comparison" resulted into pointing at a structural flaw.

Olf0 commented 5 years ago

Currently, all search results are sorted by number of levels that were caught in the database. If [n] levels were found, all results [from] level[s n-1 and lower] are discarded.

That makes sense for me as a "per country" measure. So only a single level remains for each country, if I understand correctly. But it sounds as if the results from countries with a lower (or higher; I am not sure if I have understood the order correctly) final level should be placed first (currently they seem to have the opposite order).

Olf0 commented 5 years ago

The one I suggested as a starting point (^$search-string_house_number[!0-9].*) avoids over-filtering (but fails finding "6-6" with the search string "6-") and can be extended.

What about (in Bourne Shell syntax)

if echo "$search-string_house_number" | rev | cut -c 1 | grep '[0-9]' && echo "$returned_house_number" | grep "^$search-string_house_number[0-9].*"
then discard_returned-address
fi

? Side note: This example is flawed when using characters in the search-string_house_number, which are interpreted by grep (e.g. ".", "*" ), but I would expect that not to be an issue when using proper string matching in a more advanced programming language.

rinigus commented 5 years ago

I will look into it in due course.

It is a bit unfortunate that a "speed comparison" resulted into pointing at a structural flaw.

Maybe adding a string with region cal be used to compare again that part of the performance (ensuring that it works correctly everywhere)

rinigus commented 2 years ago

As import has been reworked for the geocoder and starting from OSM Scout Server 3.0 we would have search ranking used internally, I am going to close this issue. Let's review situation when 3.0 will be out and file the issues of that search implementation.