q-m / food-ingredient-parser-ruby

Extract the structure of ingredient lists on food products
MIT License
16 stars 2 forks source link

Handle semicolon after colon #13

Open wvengen opened 6 years ago

wvengen commented 6 years ago

An ingredients list like "Schokolade (Süßungsmittel: Maltit; Kakaobutter, Kakaomasse)" contains mixed separators (; and ,). Hiere the semicolon is used to indicate the end of the second-level nesting for Maltit.

wvengen commented 6 years ago

This is not really an issue for the loose parser, which handles each separator as equal. But if the need arises, it could be implemented there as well.

wvengen commented 5 years ago

This often has a meaning, in e.g.

glucosesiroop, suiker, water, gemodificeerd zetmeel, gelatine (rund), vitamine A, vitamine C, vitamine D3, vitamine E, vitamine B6, foliumzuur, vitamine B12, biotine, pantotheenzuur, kaliumjodide, zinkcitraat, magnesiumoxide, zuurteregelaar: citroenzuur; kleurstoffen: curcumine, anthocyanen (vlierbes); natuurlijke aroma’s: sinaasappel, kers, citroen; glansmiddel: carnaubawas; plantaardige olie: kokosnootolie (Cocos nucifera L.); emulgatoren: mono- en diglyceriden van vetzuren, citroenzuuresters van mono- en diglyceriden van vetzuren; maltodextrine.

Here the semicolon ends a list after a colon.

wvengen commented 4 years ago

Another example, where it also ends the list after a colon.

Water; plantaardige oliën (zonnebloem 15,2%, raapzaad 6%, lijnzaad 4,8%, palm, palmpit, geheel geharde palmpit, geheel geharde palm); mineraal: calciumzouten van orthofosforzuur; gemodificeerd maïszetmeel; palmstearine; emulgatoren: E471 (niet dierlijk) en zonnebloemlecithine; zout 0,2%; conserveermiddel: E202; voedingszuur: citroenzuur; antioxidant: E385; aroma; vitaminen: A, thiamine (B1), riboflavine (B2), B6, foliumzuur (B11), B12 en D2; kleurstof: carotenen

wvengen commented 5 months ago

Ok, I have something that seems to work ...

rule list
  # ...
  contains:( ( (ingredient ws* ',' ws*)* ingredient_coloned )+ ( ws* ingredient (ws* ',' ws* ingredient)* ) ) <ListNode>
  # ...
end

rule ingredient_coloned_inner_list
  # ...
  contains:( ingredient_coloned_simple_with_amount_and_nest ( ws* ',' ws* ingredient_coloned_simple_with_amount_and_nest )* ';' ) <ListNode>
end
wvengen commented 5 months ago

This seems to tackle it! An ingredient listing like

Ingrediënten: mineraalwater, suiker, citroensap uit concentraat, aardbeiensap uit concentraat, smaakversterker: erythritol, natuurlijk aroma, zoetstof: steviolglycosiden; vitaminen: Vitamine B6, Vitamine B12.

used to put everything after ; in the notes, but it is properly parsed with this change! update actually, this is a somewhat malformed line: some coloned ingredients end with a comma, others with a semicolon. In this instance, one can understand that smaakversterker: erythritol is one nested ingredient, and natuurlijk aroma the next.

wvengen commented 5 months ago

Still having trouble to parse an ingredient list with a nesting IngredientColoned ending with a non-nested ingredient.

wvengen commented 5 months ago

Commit a4ca35cc9bf28ebd72162358f25046736512d3f4 handles most cases. Pending: