Open benbenben2 opened 10 months ago
@benbenben2 I'm not sure for your point 1
- blé is not a subingredient of antioxydant
Why not ? I see "antioxydant (extrait riche en tocophérols, blé)" parenthesis are for sub ingredients, aren't they ?
@benbenben2 I'm not sure for your point 1
- blé is not a subingredient of antioxydant
Why not ? I see "antioxydant (extrait riche en tocophérols, blé)" parenthesis are for sub ingredients, aren't they ?
My previous message was a bit unclear.
The ingredients list is:
antioxydant: extrait riche en tocophérols. Peut contenir des traces d'autres céréales contenant du gluten (blé, seigle, orge), d'autres fruits à coque (noix de cajou, noix de pécan), de lait et de soja.
and the parsed list is:
antioxydant (extrait riche en tocophérols, blé), seigle, orge, d'autres fruits à coque (noix de cajou, noix de pécan), de lait et de soja
Blé should not become a sub-ingredient of antioxydant
For 3. it seems to be connected to stopwords that are on top of the taxonomy file for ingredients.
"et" is listed as stopword.
When this runs:
canonicalize_taxonomy_tag($ingredients_lc, "ingredients", $before);
it removes the " et "
Could not find exactly where in the function it happens, but I think we should ignore "et" (and) as stopword if it is in the middle of the ingredient (between 2 ingredients) and at the same time use it as stopword if it is at the beginning or end of the ingredient.
What
.. antioxydant: extrait riche en tocophérols. Peut contenir des traces d'autres céréales contenant du gluten (blé, seigle, orge), d'autres fruits à coque (noix de cajou, noix de pécan), de lait et de soja.
1) put blé as a sub-ingredient of antioxydant 2) does not recognized d'autres fruits à coque (autres fruit à coque is in the taxonomy and d' is in the stopwords) 3) de lait et de soja is parsed as soj-milk
Steps to reproduce the behavior
https://world.openfoodfacts.org/product/4056489601913/granola-premier-super-nutty-new-crownfield
Expected behavior
1) blé is not a subingredient of antioxydant 2) d'autres fruits à coque should be recognized 3) de lait et de soja should be parsed as milk and soy
Additional context
Notes for 2) I tried: stopword, escape the single quote, d\'autre X add "d'autres fruits à coque" as synonym of "fr:fruits à coque" X replace "d'autres fruits à coque" by "fruits à coque" v replace "d'autres fruits à coque" by "autres fruits à coque" v tried many variants "d\'autres fruits à coque", "d-autres-fruits-a-coque" x