osm-search / Nominatim

Open Source search based on OpenStreetMap data
https://nominatim.org
GNU General Public License v3.0
3.06k stars 711 forks source link

Search for Japan city details using Japanese characters does not lead to correct results via API #3484

Open patelmm79 opened 1 month ago

patelmm79 commented 1 month ago

What did you search for?

[<!-- Please try to provide a link to your search. You can go to https://nominatim.openstreetmap.org and repeat your search there. If you originally found the issue somewhere else, please tell us what software/website you were using. --https://nominatim.openstreetmap.org/search?format=jsonv2&addressdetails=1&limit=10&namedetails=2&polygon_geojson=0&extratags=1&city=%E4%BB%99%E5%8F%B0

What result did you get?

Xiantai, Pingdingshan, Henan, China. This was the only result.

Including Japan as country yields 0 results. https://nominatim.openstreetmap.org/search?format=jsonv2&addressdetails=1&limit=10&namedetails=2&polygon_geojson=0&extratags=1&city=%E4%BB%99%E5%8F%B0&country=%E6%97%A5%E6%9C%AC

What result did you expect?

Sendai, Japan. Search using city name via Nominatim UI achieves a relevant result for the Railway location:

https://nominatim.openstreetmap.org/ui/details.html?osmtype=N&osmid=3570916502&class=railway

This is the place location for Sendai.

https://nominatim.openstreetmap.org/ui/details.html?osmtype=N&osmid=752184864&class=place

mtmail commented 1 month ago

So the issue seems to be that the city Sendai has the name '仙台市' in OSM data https://www.openstreetmap.org/node/752184864 and cannot be found when searching for '仙台'. The '市' suffix stands for 'city'. That's quite common with regional names in Japan. We should check how common and could add another database entry with the suffix removed.

Adding @miku0 who worked on such a list https://github.com/miku0/Nominatim/blob/soft_phrase2/nominatim/api/search/icu_tokenizer_japanese.py#L23 in the past. It's part of an older PR https://github.com/osm-search/Nominatim/pull/3158

lonvia commented 1 month ago

We used to have a special rule in the old tokenizer for those prefixes. We can probably bring them back either via a sanitizer or variants. Depends a bit on how frequently the characters appear in other contexts as suffixes.

Miku's PR is rather for splitting addresses into words. Not quite the same but related.