q-m / food-ingredient-parser-ruby

Extract the structure of ingredient lists on food products
MIT License
16 stars 2 forks source link

Improve handling of 'and' #3

Closed wvengen closed 6 years ago

wvengen commented 6 years ago

There is some rudimentary support in the strict parser, but it has some issues (some valid ingredient lists with 'and' are not parsed).

wvengen commented 6 years ago

One approach could be to move the splitting of ingredients with 'and' to a transformer, working on the parsed tree. The only thing it wouldn't catch, is if the first of the two ingredients is a nested ingredient or has an amount, e.g. oil (canola) and fats or tomato 30% and paprika 20%. These need to be handled in the parser.

My current idea would be to: (1) strip 'and' handling from the current parser, which cleans up the code, (2) handle 'and' for the above cases so these can be parsed, and (3) optionally split ingredients with 'and' after parsing. This last step may also be omitted, because it remains difficult to know whether it needs to be split or not (like red and black beans).

wvengen commented 6 years ago

(1) and (2) done in 8d3b7f26b6fd00b5d05894389472a62905ce63df. Number of parsed ingredients (from excerpt file) increases to 77.5%.