openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
663 stars 392 forks source link

Improve ingredients parsing and taxonomy for German, reduce number of unknown ingredients #2231

Open stephanegigandet opened 5 years ago

stephanegigandet commented 5 years ago

Meta bug to track the issues with parsing ingredients lists in German.

See also bug #2023 for general ingredient parsing improvements in all languages.

Current status:

https://de.openfoodfacts.org/ingredients?stats=1

Type Unique tags Occurrences
known 1770 (5.41%) 217812 (78.18%)
unknown 30936 (94.59%) 60787 (21.82%)
all 32707 (100.00%) 278599 (100.00%)

27 Aug 2019

Type Unique tags Occurrences
known 2084 (5.65%) 335677 (85.07%)
unknown 34796 (94.35%) 58890 (14.93%)
all 36881 (100.00%) 394567 (100.00%)

Feb 7th 2020

aleene commented 5 years ago

Issues:

chk1 commented 5 years ago

QS or QS-Ware is a label for certain products, mostly meat, produced under certain (good/favourable) conditions, I tagged a few with "QS" before https://world.openfoodfacts.org/label/qs

You are right that it probably shouldn't be listed in ingredients, I guess it ended up in there because OCR doesn't remove these kind of texts(?)

stephanegigandet commented 5 years ago

@chk1 @aleene : I had a look at the lists with QS-Ware, we can probably use the same parsing feature we have for organic and/or fair-trade ingredients (things like Sugar, salt. : organic)

stephanegigandet commented 5 years ago

Applied changes from @aleene to the German ingredients to all products, we are now at exactly 80% of recognized ingredients for German:

Type Unique tags Occurrences
known 1791 (5.56%) 225071 (80.00%)
unknown 30398 (94.43%) 56284 (20.00%)
all 32190 (100.00%) 281355 (100.00%)

https://de.openfoodfacts.org/ingredients?stats=1

mahlzahn commented 5 years ago

I went through all ingredients from https://de.openfoodfacts.org/ingredients?status=unknown with unknown status and more than 50 occurencies.

Quite some of them are already in the ingredients.txt. Why do they still show up in the list with status unknown?

sign meaning
! Already in ingredients.txt
v Added to ingredients.txt in #2323
x needs further follow-up
? may safely be ignored

!

sahnepulver

244

Already in ingredients.txt

v

käsereikulturen

171

Added to ingredients.txt

v

entrahmte-milch

163

Added to ingredients.txt

?

157

Empty string probably like: ,,

v

joghurt-mild

155

Added to ingredients.txt

?

tr

155

Belongs to “Fett i. Tr.”, can be safely ignored.

!

sonnenblume

151

Already in ingredients.txt

!

pflanzliches-öl

135

Already in ingredients.txt

x

reifekulturen

114

Used in many sausage products, probably “Glucon Delta Lecton E 575” according to https://web.archive.org/web/20180719074852/https://www.merkur.de/wirtschaft/sind-tricks-lebensmittelindustrie-zr-7303827.html

x

würze

111

https://de.wikipedia.org/wiki/W%C3%BCrze_(Lebensmittel)

https://www.lebensmittelklarheit.de/informationen/wuerze-hat-mit-gewuerzen-nicht-viel-zu-tun

Leitsätze für Gewürze und andere würzende Zutaten. In: Deutsches Lebensmittelbuch. Deutsche Lebensmittelbuch-Kommission.

may contain soya / wheat

Added to ingredients.txt for Proteinhydrolysat

Edit: Added to ingredients.txt as new entry.

!

natürliches-vanille-aroma

108

Already in ingredients.txt

!

pflanzliches-fett

104

Already in ingredients.txt

?

trennmittel

100

May be added as ”release agent“? Can be safely ignored, because always specified which release agent is used.

!

weißer-pfeffer

99

Already in ingredients.txt

v

kokosnuss

96

Added to ingredients.txt

v

shea

95

Always in combination with vegetable fat.

Added to ingredients.txt

!

orangensaft-aus-orangensaftkonzentrat

92

Already in ingredients.txt

v

rosinen

89

Added to ingredients.txt

v

koffein

89

Added to ingredients.txt

!

palmkern

85

Already in ingredients.txt

v

emulgator-sojalecithine

83

Added to ingredients.txt

v

saflor

82

Added to ingredients.txt

!

säuerungskulturen

79

Already in ingredients.txt

x

sauerungsmittel

77

Should be changed for products to Säuerungsmittel → bot?

v

hibiskus

76

Added to ingredients.txt

!

weizeneiweiß

76

Already in ingredients.txt

!

aroma-koffein

75

Already in ingredients.txt

but only with German entry

v

kichererbsen

75

Added to ingredients.txt

x

homogenisiert

75

Needs parsing/OCR improvements, often in combinations like “Frische Vollmilch, 3,5% Fett, pasteurisiert, homogenisiert.”

!

kokosöl

74

Added to ingredients.txt

?

fett-i

73

Belongs to “Fett i. Tr.”, can be safely ignored.

!

natursauerteig

73

Already in ingredients.txt

but only with German entry

v

fettarme-milch

71

Added to ingredients.txt

Milk categories need further cleanup, e.g. Magermilch (0.3%) in semi-skimmed milk section and fettarme Milch (>1.5%) in skimmed milk section

v

sonnenblumenlecithin

67

Added to ingredients.txt

x

pasteurisiert

66

Needs parsing/OCR improvements, often in combinations like “Frische Vollmilch, 3,5% Fett, pasteurisiert, homogenisiert.”

!

reifungskulturen

64

Already in ingredients.txt

!

glucosesirup

62

Already in ingredients.txt

v

fruktose

60

Added to ingredients.txt

!

muskat

60

Already in ingredients.txt

v

möhren

58

Added to ingredients.txt

v

weißweinessig

58

Added to ingredients.txt

!

tapiokastärke

55

Already in ingredients.txt

v

füllung

54

Added to ingredients.txt

x

monound-diglyceride-von-speisefettsäuren

54

Parsing/OCR error. Should be changed for products from “Mono - und …” to “Mono- und …” → bot?

!

pastinaken

54

Already in ingredients.txt

!

kokosraspeln

54

Already in ingredients.txt

?

aus-kontrolliert-biologischem-anbau

53

Can be safely ignored

!

pfirsiche

53

Already in ingredients.txt

v

starterkulturen

52

Usually in cheeses (also in few meat products)

Added to ingredients.txt

?

unter-schutzatmosphäre-verpackt

51

Can be safely ignored

x

fr:speisesalz

50

Should be changed for products to Speisesalz → bot?

v

calciumcitrate

50

Added to ingredients.txt

aleene commented 5 years ago

I added many of those. Maybe it takes some time before everything is parsed again?

teolemon commented 4 years ago
Type Unique tags Occurrences
known 2084 (5.65%) 335677 (85.07%)
unknown 34796 (94.35%) 58890 (14.93%)
all 36881 (100.00%) 394567 (100.00%)

Feb 7th 2020