Open mcioffi opened 1 month ago
Hi @mcioffi
Thanks for feature request. This certainly seems possible and I'm willing to give it a go.
I can think of two ways we might achieve this:
That's great to hear. I believe folks might side with (2), though thats just myself speaking to thin air. The training side would ensure robustness.
The parser engine has helped with a trivia site that we are expanding under theblankdish.com to include a whole repo of recipes instead of the experimental single one now.
I would imagine others could benefit from the feature to analyze raw ingredients 👍🏼
It's always good to hear that people are finding this useful.
After a bit of investigation, options 1 that I suggested above simply won't work because the part of speech tags don't separate the core ingredient from any other words.
I've started having a look at the relabelling the training sentences. I'll probably do a small-ish subset of them to see how it's working, but that will still take a while. Some examples of how I think this will work:
Effectively, we would identify the token(s) that reduces the ingredient to its most fundamental identity.
@strangetom the tokenization above looks great. I wonder if one corner case is when the traditional token noun becomes the adjective.
To illustrate:
In those types of ingredients where the texture changes but the core ingredient still applies, I wonder what would be best.
My thinking on how to do this has involved a bit since my last comment. I've decided to use the FDC's Foundation Foods as the starting point for deciding what the fundamental ingredient names should be.
The results will be a little more detailed than I suggested above (e.g. red onion
and yellow onion
would be separate, instead of both being onion
), For the cases you mentioned, it would mean that pepper
and pepper flakes
are considered different fundamental names, as would garlic
and garlic powder
.
To give a quick peek at how it currently works:
>>> from ingredient_parser import parse_ingredient
>>> parse_ingredient("1 organic mini cucumber").name
IngredientText(text='organic mini cucumber', confidence=0.938989)
>>> parse_ingredient("1 organic mini cucumber", core_names=True).name
IngredientText(text='cucumber', confidence=0.900238)
>>> parse_ingredient("1 organic mini cucumber", core_names=True).comment
IngredientText(text='organic mini', confidence=0.958365)
This is using a model trained from the core_names
branch.
Thanks @strangetom. Given that FDC's Foundation Foods is being folded in, do see any benefit in piping through the foodCategory
field in their datasets into the ParsedIngredient
class?
It's auxiliary to the goal of this feature, but I took note of it while observing the FDC's datasets, since it does provide a basic categorization, a rough idea —
FDC Foundation Food item
{
"description": "Beans, Dry, Pinto (0% moisture)",
...
"foodCategory": {
"description": "Legumes and Legume Products"
},
...
"fdcId": 747445,
"dataType": "Foundation"
}
ParsedIngredient class
ParsedIngredient(
name=IngredientText(text='pinto beans', confidence=0.999193),
size=None,
amount=[IngredientAmount(quantity='6',
unit=<Unit('cup')>,
text='6 cups',
confidence=0.999906,,
APPROXIMATE=False,
SINGULAR=False)],
preparation=IngredientText(text='cooked', confidence=0.999193),
comment=None,
purpose=None,
category="Legumes and Legume Products"
sentence='6 cups of cooked pinto beans'
)
Additionally, I am working in parallel to your repo, in scraping some other sources (e.g. bonappetit, saveur, food52) to add to your recipe training data since many (if not most) recipe sites and blogs adhere to the Google Structured Data for Recipes as part of SEO. Because of this, fetching entire recipe portfolios with ingredients is much more trivial these days than before.
At the moment the I've just used the FDC's list as a starting point for identifying the foundation foods. There isn't actually a mapping to anything specific in the FDC's list. Adding the category might be possible, but I'm not sure how difficult it would be. It may be as simple as just having a big dict mapping the foundation foods to the category.
More data is always better, but it's best if the any new data adds something to the training data that is currently lacking. For example, the BBC dataset largely uses metric units, where the others use US customary units; the allrecipes dataset has lots of branded ingredient names, where the others don't; the cookstr dataset includes lots of long and complex sentence, where other datasets include mostly shorter and simpler sentences.
@strangetom firstly and importantly, thank you for open-sourcing this to the community.
On your labeling data principles you mention
One feature request that could be useful for developers and applications that are not looking for strict recipe adherence, is to have add a parameter on the parser method
parse_ingredient
that loosens the strictness of the adjective labeling, e.gpreserve_adjectives
true | false
For example, organic mini cucumber is 100% the correct extraction, and it necessary for the recipe to adhere to its original intent, but folks might want to loosen it to just cucumber as part of data analysis on recipes.
Thanks!