Open stephanegigandet opened 5 years ago
Scanning through it, many are not ingredients, but things like traces. So try to parse those out first.
The first one is at 71 crème pasteurisé. I can start with these. Is a nice list for testing the parser. The issues really appear at the bottom.
PS can you accept a few pull requests. I am waiting for the accepts, before continuing.
proteines de {allergene}lait{/allergene} farine de {allergene}ble{/allergene} amidon modifié de maïs et/ou de pomme de terre proteines de {allergene}soja{/allergene} rehydratees fibres de {allergene}ble{/allergene} {allergene}oeufs{/allergene} entiers frais https://fr.openfoodfacts.org/editeur/scamark/ingredient/alcohol-denat https://fr.openfoodfacts.org/editeur/scamark/ingredient/parfum https://fr.openfoodfacts.org/editeur/scamark/ingredient/lt https://fr.openfoodfacts.org/editeur/scamark/ingredient/zu:eau
In fact it is not to bad. The important ingredients that occur often seem to be covered, just some synonyms to be added.
I added the ingredients here: #2027 (still work in progress)
Why is this one not recognised, it is in the taxonomy. accent removed to early?: purée de tomate
Current stats: https://fr.openfoodfacts.org/editeur/scamark/ingredients?stats=1
Type | Unique tags | Occurrences |
---|---|---|
known | 1982 (37.00%) | 110227 (94.99%) |
unknown | 3374 (62.98%) | 5808 (5.01%) |
all | 5357 (100.00%) | 116035 (100.00%) |
Almost 95%. :-) A lot of the remaining strings are actually not ingredients, but mentions related to ingredients that we could try to parse into labels.
I'll filter out sentences like "percentages are expressed on the total product": (e.g. for Scamark :)
les-pourcentages-sont-exprimes-sur-le-produit-total | 135 | * |
---|---|---|
pourcentages-exprimes-sur-le-produit-total | 52 | * |
pourcentages-exprimes-sur-le-total-de-la-recette | 19 | * |
pourcentages-exprimes-sur-la-recette-au-total | 14 | * |
les-pourcentages-sont-exprimes-sur-le-produit-total-avant-friture | 11 | * |
pourcentage-exprime-sur-la-sauce | 9 | * |
pourcentages-exprimes-sur-le-produit-total-avant-friture | 7 | * |
exprime-sur-la-sauce | 7 | * |
exprimes-sur-la-salade-composee | 5 | * |
pourcentages-exprimes-sur-les-nems | 5 | * |
les-pourcentages-sont-exprimes-sur-le-produit-fini | 4 | * |
exprimes-sur-le-mini-quatre-quarts | 4 | * |
exprimes-sur-le-produit-total | 4 | * |
les-pourcentages-sont-exprimes-sur-le-produit-total-avant-precuisson | 4 | * |
Changes above applied to production on scamark products:
Type | Unique tags | Occurrences |
---|---|---|
known | 2019 (45.62%) | 111302 (96.63%) |
unknown | 2406 (54.36%) | 3876 (3.37%) |
all | 4426 (100.00%) | 115178 (100.00%) |
It's getting much better :)
Overall stats for fr, before updating all products:
https://fr.openfoodfacts.org/ingredients?stats=1
Type | Unique tags | Occurrences |
---|---|---|
known | 3561 (0.74%) | 3396667 (80.86%) |
unknown | 478100 (99.26%) | 804162 (19.14%) |
all | 481662 (100.00%) | 4200829 (100.00%) |
Corresponding ingredient analysis:
Présence d'huile de palme inconnue | 129786 | Caractère végétarien inconnu | 127237 | Non végétalien | 97613 | Caractère végétalien inconnu | 84865 | Sans huile de palme | 56846 | Non végétarien | 38902 | Végétarien | 31865 | Végétalien | 23730 | Huile de palme | 13407 | Peut-être végétarien | 13379 | Pourrait contenir de l'huile de palme | 11344 | Peut-être végétalien | 5175
I'm going to try to significantly reduce the number of unknown ingredients for products that have correct ingredients lists (e.g. product data from manufacturers). This will allow better NOVA computation, but also new vegetarian / vegan recognition etc.
I'm going to use the Scamark / Leclerc import as a benchmark.
Today we have 5220 products with 5764 ingredients, 4035 o those ingredients are unknown (not in the taxonomy): https://fr.openfoodfacts.org/editeur/scamark/ingredients
There are 2 main ways to improve this: