openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
645 stars 374 forks source link

Fix search with Umlauts in German vs. English language #10283

Open goerlitz opened 4 months ago

goerlitz commented 4 months ago

What

When searching German products with search terms containing Umlauts (ä,ö.ü,ß) the number of search results is different when using language English vs. German.

Steps to reproduce the behavior

Search term without Umlauts -> same results:

Search term containing Umlaut -> different results

It seems that the transliteration of ä->ae, ö->oe, ü->ue, ß->ss for indexing and search are handled differently German vs. English.

Expected behavior

The search should return the same results independent of the selected language and umlauts used. (like "Pâté" <-> pate in French).

goerlitz commented 4 months ago

https://github.com/openfoodfacts/openfoodfacts-server/issues/455 might be related.

stephanegigandet commented 4 months ago

@goerlitz We currently index words differently depending on language. For languages like French and English, users commonly drop accents, and there are very few conflicts (where removing accents gives an existing but different word). So when we index and query in English or French, we match both accented and unaccented.

For German, umlauts are more meaningful and are very rarely removed, so we keep them, in both queries and index. But that means searching in English (in the de-en domain) for accented words will not work well with products in German. If you want to search for German products, with German words in the query, then you should indicate that the query is in German (using lc=de or using the de domain).

Note that we will have a new search backend soon, it might behave differently. Check out the #search channel on Slack.