openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
643 stars 373 forks source link

Dutch/Estonian: dash is not always a separator #6122

Open aleene opened 2 years ago

aleene commented 2 years ago

Describe the bug

The dash should not always been interpreted as a separator. In dutch it is a way to limit repetitions.

Other examples:

To Reproduce

See: https://nl.openfoodfacts.org/product/8718907369589/bloemenhoning-albert-heijn

Expected behavior

For instance: EU- en niet-EU-honing, should not be expanded to EU, niet-EU-honing, but should left untouched. In dutch this is interpreted asEU-honing, niet-EU-honing.

Additional context

A parse rule could be based on the surroundings of the dash:

Number of products impacted

Happens quite often.

Part of

stephanegigandet commented 2 years ago

It's strange, if I remove the extra space in "gemengde EU - en niet-EU-honing" (before the first EU), it's added back.

aleene commented 2 years ago

Another example:

aleene commented 2 years ago

In Estonian the same language construction is used:

See also https://et.wiktionary.org/wiki/flavoring Both mean flavouring of some sort

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity.

aleene commented 2 years ago

Another interesting product: https://nl.openfoodfacts.org/product/8712800025665/brownie-mona . in this case an ingredient is not parsed at all.

alexgarel commented 2 years ago

Oups sorry I inadvertently removed ingredients on above product, but restore them thereafter !