Open wvengen opened 6 years ago
For examples, run grep '\*\*\w[^,;.:()]*\w\*\*' data/ingredient-samples-nl
.
It would be a good idea to add allergen-detection with this (#4).
I've only observed it around a word, or mid-word, e.g.
**melk**
**melkeiwitpoeder**
**mosterd**-dille
**soja**saus
**schapen**- en **geitenmelk**
One idea would be to strip such occurences, and add the text within the double asterisks as allergens of the ingredient.
Though **schapen**- en **geitenmelk**
is actually ambivalent: does it contain sheep and goat-milk, or does it contain sheep-milk and goat-milk? (I guess the latter, but this can only be known by taking into account domain knowledge and context, so not something for a parser.)
Sometimes ingredients are surrounded by double asterisks, this is probably marking of an allergen (see also #4). The strict parser doesn't currently handle this (or recognizes it as the start of notes), and the loose parser recognizes the first
**
as mark and includes the second**
in the resulting name.This happens in 0.15% of the ingredient lists.