openfoodfacts / openfoodfacts-server

Open Food Facts database, API server and web interface - 🐪🦋 Perl, CSS and JS coders welcome 😊 For helping in Python, see Robotoff or taxonomy-editor
http://openfoodfacts.github.io/openfoodfacts-server/
GNU Affero General Public License v3.0
658 stars 387 forks source link

Handle hyphenated words broken with a dash in ingredients lists (analysis and OCR extraction) #3007

Open stephanegigandet opened 4 years ago

stephanegigandet commented 4 years ago

We could try to "unbreak" words separated with a dash.

e.g. https://world.openfoodfacts.org/product/5411188118121/alpro

image

"eau, fèves de soja décort- quées (10,8%), citrate tricalcique, stabilisant (pectines), correcteurs d'acidité (citrates de sodium, acide citrique), arôme naturel, sel marin, antioxygènes (extrait riche en tocophé- rols, esters d'acides gras de l'acide ascor- bique), vitamines (B12, D2), ferments de yaourt (S. thermophilus, L. bulgaricus)."

Related to #2239

hangy commented 4 years ago

This could be pretty complex, as a hypen can also be a valid part of a word (at least in German). Basically, a dictionary would need to be integrated to avoid false positives.

stephanegigandet commented 4 years ago

@hangy : right, the way I do it for similar parsing features (like breaking "A and B" is to first check if the source "A and B" exists in the taxonomy, in that case I do nothing. If it doesn't exist, then I check if A exists and if B exists. If both exist, then I assume that it is 2 ingredients. We can try something similar for recombining.

stephanegigandet commented 4 years ago

" jus de pample - mousse rose à base de concentré 14%" - https://fr.openfoodfacts.org/produit/3092719701443/sirop-de-pamplemousse-rose-zero-sucres-teisseire

stephanegigandet commented 4 years ago

From #3616 reported by @AcuarioCat in Spanish:

agua, habas de soja descascarilladas (4%), malto - dextrina (fibra), fructosa, azücar, car - bonato célcico, corrector de acidez (fosfato monopotésico), sal marina, aroma, estabilizante (goma gellan), vitaminas (riboflavina (B2), B12, D), aroma natural. Naturalmente sin lactosa y sin gluten.

which should be:

agua, habas de soja descascarilladas (4%), maltodextrina (fibra), fructosa, azücar, carbonato célcico, corrector de acidez (fosfato monopotésico), sal marina, aroma, estabilizante (goma gellan), vitaminas (riboflavina (B2), B12, D), aroma natural

github-actions[bot] commented 8 months ago

This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts