openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
663 stars 392 forks source link

Improve ingredients parsing and taxonomy, reduce number of unknown ingredients #2023

Open stephanegigandet opened 5 years ago

stephanegigandet commented 5 years ago

I'm going to try to significantly reduce the number of unknown ingredients for products that have correct ingredients lists (e.g. product data from manufacturers). This will allow better NOVA computation, but also new vegetarian / vegan recognition etc.

I'm going to use the Scamark / Leclerc import as a benchmark.

Today we have 5220 products with 5764 ingredients, 4035 o those ingredients are unknown (not in the taxonomy): https://fr.openfoodfacts.org/editeur/scamark/ingredients

There are 2 main ways to improve this:

  1. adding relevant ingredients to the taxonomy
  2. improving ingredient parsing (decompounding compound ingredients, extracting ingredient properties like origin, quality etc.)
aleene commented 5 years ago

Scanning through it, many are not ingredients, but things like traces. So try to parse those out first.

The first one is at 71 crème pasteurisé. I can start with these. Is a nice list for testing the parser. The issues really appear at the bottom.

aleene commented 5 years ago

PS can you accept a few pull requests. I am waiting for the accepts, before continuing.

aleene commented 5 years ago

proteines de {allergene}lait{/allergene} farine de {allergene}ble{/allergene} amidon modifié de maïs et/ou de pomme de terre proteines de {allergene}soja{/allergene} rehydratees fibres de {allergene}ble{/allergene} {allergene}oeufs{/allergene} entiers frais https://fr.openfoodfacts.org/editeur/scamark/ingredient/alcohol-denat https://fr.openfoodfacts.org/editeur/scamark/ingredient/parfum https://fr.openfoodfacts.org/editeur/scamark/ingredient/lt https://fr.openfoodfacts.org/editeur/scamark/ingredient/zu:eau

aleene commented 5 years ago

In fact it is not to bad. The important ingredients that occur often seem to be covered, just some synonyms to be added.

aleene commented 5 years ago

I added the ingredients here: #2027 (still work in progress)

aleene commented 5 years ago

Why is this one not recognised, it is in the taxonomy. accent removed to early?: purée de tomate

stephanegigandet commented 5 years ago

Current stats: https://fr.openfoodfacts.org/editeur/scamark/ingredients?stats=1

Type Unique tags Occurrences
known 1982 (37.00%) 110227 (94.99%)
unknown 3374 (62.98%) 5808 (5.01%)
all 5357 (100.00%) 116035 (100.00%)

Almost 95%. :-) A lot of the remaining strings are actually not ingredients, but mentions related to ingredients that we could try to parse into labels.

stephanegigandet commented 5 years ago

I'll filter out sentences like "percentages are expressed on the total product": (e.g. for Scamark :)

les-pourcentages-sont-exprimes-sur-le-produit-total 135 *
pourcentages-exprimes-sur-le-produit-total 52 *
pourcentages-exprimes-sur-le-total-de-la-recette 19 *
pourcentages-exprimes-sur-la-recette-au-total 14 *
les-pourcentages-sont-exprimes-sur-le-produit-total-avant-friture 11 *
pourcentage-exprime-sur-la-sauce 9 *
pourcentages-exprimes-sur-le-produit-total-avant-friture 7 *
exprime-sur-la-sauce 7 *
exprimes-sur-la-salade-composee 5 *
pourcentages-exprimes-sur-les-nems 5 *
les-pourcentages-sont-exprimes-sur-le-produit-fini 4 *
exprimes-sur-le-mini-quatre-quarts 4 *
exprimes-sur-le-produit-total 4 *
les-pourcentages-sont-exprimes-sur-le-produit-total-avant-precuisson 4 *
stephanegigandet commented 5 years ago

Changes above applied to production on scamark products:

Type Unique tags Occurrences
known 2019 (45.62%) 111302 (96.63%)
unknown 2406 (54.36%) 3876 (3.37%)
all 4426 (100.00%) 115178 (100.00%)

It's getting much better :)

stephanegigandet commented 5 years ago

Overall stats for fr, before updating all products:

https://fr.openfoodfacts.org/ingredients?stats=1

Type Unique tags Occurrences
known 3561 (0.74%) 3396667 (80.86%)
unknown 478100 (99.26%) 804162 (19.14%)
all 481662 (100.00%) 4200829 (100.00%)

Corresponding ingredient analysis:

Présence d'huile de palme inconnue | 129786 |   Caractère végétarien inconnu | 127237 |   Non végétalien | 97613 |   Caractère végétalien inconnu | 84865 |   Sans huile de palme | 56846 |   Non végétarien | 38902 |   Végétarien | 31865 |   Végétalien | 23730 |   Huile de palme | 13407 |   Peut-être végétarien | 13379 |   Pourrait contenir de l'huile de palme | 11344 |   Peut-être végétalien | 5175