q-m / food-ingredient-parser-ruby

Extract the structure of ingredient lists on food products
MIT License
16 stars 2 forks source link

Handle **ingredient** #11

Open wvengen opened 6 years ago

wvengen commented 6 years ago

Sometimes ingredients are surrounded by double asterisks, this is probably marking of an allergen (see also #4). The strict parser doesn't currently handle this (or recognizes it as the start of notes), and the loose parser recognizes the first ** as mark and includes the second ** in the resulting name.

This happens in 0.15% of the ingredient lists.

wvengen commented 6 years ago

For examples, run grep '\*\*\w[^,;.:()]*\w\*\*' data/ingredient-samples-nl.

wvengen commented 4 years ago

It would be a good idea to add allergen-detection with this (#4).

wvengen commented 4 months ago

I've only observed it around a word, or mid-word, e.g.

wvengen commented 4 months ago

One idea would be to strip such occurences, and add the text within the double asterisks as allergens of the ingredient. Though **schapen**- en **geitenmelk** is actually ambivalent: does it contain sheep and goat-milk, or does it contain sheep-milk and goat-milk? (I guess the latter, but this can only be known by taking into account domain knowledge and context, so not something for a parser.)