openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
656 stars 386 forks source link

bug: traces are not correctly parsed #9616

Open benbenben2 opened 10 months ago

benbenben2 commented 10 months ago

What

.. antioxydant: extrait riche en tocophérols. Peut contenir des traces d'autres céréales contenant du gluten (blé, seigle, orge), d'autres fruits à coque (noix de cajou, noix de pécan), de lait et de soja.

1) put blé as a sub-ingredient of antioxydant 2) does not recognized d'autres fruits à coque (autres fruit à coque is in the taxonomy and d' is in the stopwords) 3) de lait et de soja is parsed as soj-milk

Steps to reproduce the behavior

https://world.openfoodfacts.org/product/4056489601913/granola-premier-super-nutty-new-crownfield

  1. Click on 'Details of the analysis of the ingredients

Expected behavior

1) blé is not a subingredient of antioxydant 2) d'autres fruits à coque should be recognized 3) de lait et de soja should be parsed as milk and soy

Additional context

Notes for 2) I tried: stopword, escape the single quote, d\'autre X add "d'autres fruits à coque" as synonym of "fr:fruits à coque" X replace "d'autres fruits à coque" by "fruits à coque" v replace "d'autres fruits à coque" by "autres fruits à coque" v tried many variants "d\'autres fruits à coque", "d-autres-fruits-a-coque" x

alexgarel commented 10 months ago

@benbenben2 I'm not sure for your point 1

  1. blé is not a subingredient of antioxydant

Why not ? I see "antioxydant (extrait riche en tocophérols, blé)" parenthesis are for sub ingredients, aren't they ?

benbenben2 commented 10 months ago

@benbenben2 I'm not sure for your point 1

  1. blé is not a subingredient of antioxydant

Why not ? I see "antioxydant (extrait riche en tocophérols, blé)" parenthesis are for sub ingredients, aren't they ?

My previous message was a bit unclear.

The ingredients list is:

antioxydant: extrait riche en tocophérols. Peut contenir des traces d'autres céréales contenant du gluten (blé, seigle, orge), d'autres fruits à coque (noix de cajou, noix de pécan), de lait et de soja.

and the parsed list is:

antioxydant (extrait riche en tocophérols, blé), seigle, orge, d'autres fruits à coque (noix de cajou, noix de pécan), de lait et de soja

Blé should not become a sub-ingredient of antioxydant

benbenben2 commented 10 months ago

For 3. it seems to be connected to stopwords that are on top of the taxonomy file for ingredients.

"et" is listed as stopword. When this runs: canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $before); it removes the " et "

Could not find exactly where in the function it happens, but I think we should ignore "et" (and) as stopword if it is in the middle of the ingredient (between 2 ingredients) and at the same time use it as stopword if it is at the beginning or end of the ingredient.