pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
218 stars 162 forks source link

Indexing and normalisation of Cyrillic characters #1636

Open taygun opened 1 year ago

taygun commented 1 year ago

Describe the bug When searching for the address ("Олега Оникієнка вулиця 77а") of this OSM place no result are returned. The issue seems to be caused by the fact the the address is indexed with Cyrillic "a". If the query search contains the Cyrillic character "a", the above address is returned.

Steps to Reproduce

Steps to reproduce the behavior: No results returned when searched with Latin Small Letter A: pelias.github.io Result returned when searched with Cyrillic Small Letter A: pelias.github.io

Expected behavior Expected the address to be returned when using Latin character

missinglink commented 1 year ago

Hmm yes I can confirm the issue you are seeing, it seems to be affecting queries to the /v1/autocomplete endpoint but not the /v1/search endpoint, which helps narrow down the scope.

We use the icu-folding filter in elasticsearch to 'fold' the Cyrillic form to the Latin form.

It seems as though we are using this filter correctly in all of the analyzers, with the exception of peliasHousenumber which has a numeric character filter, and so it doesn't apply.

I'm not really sure what's going on here, the expected behaviour is that we fold Cyrillic to ASCII for precisely this purpose.

orangejulius commented 1 year ago

Ah, very nice discovery @missinglink. I think we originally discovered this issue back in https://github.com/pelias/pelias/issues/833 but never narrowed down the cause.

It feels like adding the icu-folding filter is relatively safe, maybe we should try that out?