strangetom / ingredient-parser

A tool to parse recipe ingredients into structured data
https://ingredient-parser.readthedocs.io/en/latest/
MIT License
63 stars 11 forks source link

[Feature request] Optional parameter to loosen adjective strictness on labeling #21

Open mcioffi opened 1 month ago

mcioffi commented 1 month ago

@strangetom firstly and importantly, thank you for open-sourcing this to the community.

On your labeling data principles you mention

Adjectives that are a fundamental part of the ingredient identity should be part of the name ... It is recognised that this can be subjective. Universal correctness is not the main goal of this, only consistency.

One feature request that could be useful for developers and applications that are not looking for strict recipe adherence, is to have add a parameter on the parser method parse_ingredient that loosens the strictness of the adjective labeling, e.g preserve_adjectives true | false

For example, organic mini cucumber is 100% the correct extraction, and it necessary for the recipe to adhere to its original intent, but folks might want to loosen it to just cucumber as part of data analysis on recipes.

Thanks!

strangetom commented 1 month ago

Hi @mcioffi

Thanks for feature request. This certainly seems possible and I'm willing to give it a go.

I can think of two ways we might achieve this:

  1. As part of the post-processing, use the part of speech tags for any tokens labelled as NAME to distinguish between the adjectives and nouns. I think this will work, but it might not be very robust.
  2. Re-label the underlying model's training data to include a new label to identify the adjectives in the name, and train the model to distinguish between the adjectives and nouns in the name. I think this will be the more robust option, but will obviously require updating the training data which will take a long time.
mcioffi commented 1 month ago

That's great to hear. I believe folks might side with (2), though thats just myself speaking to thin air. The training side would ensure robustness.

The parser engine has helped with a trivia site that we are expanding under theblankdish.com to include a whole repo of recipes instead of the experimental single one now.

I would imagine others could benefit from the feature to analyze raw ingredients 👍🏼

strangetom commented 1 month ago

It's always good to hear that people are finding this useful.

After a bit of investigation, options 1 that I suggested above simply won't work because the part of speech tags don't separate the core ingredient from any other words.

I've started having a look at the relabelling the training sentences. I'll probably do a small-ish subset of them to see how it's working, but that will still take a while. Some examples of how I think this will work:

Effectively, we would identify the token(s) that reduces the ingredient to its most fundamental identity.

mcioffi commented 2 weeks ago

@strangetom the tokenization above looks great. I wonder if one corner case is when the traditional token noun becomes the adjective.

To illustrate:

In those types of ingredients where the texture changes but the core ingredient still applies, I wonder what would be best.

strangetom commented 2 weeks ago

My thinking on how to do this has involved a bit since my last comment. I've decided to use the FDC's Foundation Foods as the starting point for deciding what the fundamental ingredient names should be.

The results will be a little more detailed than I suggested above (e.g. red onion and yellow onion would be separate, instead of both being onion), For the cases you mentioned, it would mean that pepper and pepper flakes are considered different fundamental names, as would garlic and garlic powder.

To give a quick peek at how it currently works:

>>> from ingredient_parser import parse_ingredient
>>> parse_ingredient("1 organic mini cucumber").name
IngredientText(text='organic mini cucumber', confidence=0.938989)
>>> parse_ingredient("1 organic mini cucumber", core_names=True).name
IngredientText(text='cucumber', confidence=0.900238)
>>> parse_ingredient("1 organic mini cucumber", core_names=True).comment
IngredientText(text='organic mini', confidence=0.958365)

This is using a model trained from the core_names branch.

mcioffi commented 3 days ago

Thanks @strangetom. Given that FDC's Foundation Foods is being folded in, do see any benefit in piping through the foodCategory field in their datasets into the ParsedIngredient class?

It's auxiliary to the goal of this feature, but I took note of it while observing the FDC's datasets, since it does provide a basic categorization, a rough idea —

FDC Foundation Food item

{
    "description": "Beans, Dry, Pinto (0% moisture)",
    ...
    "foodCategory": {
        "description": "Legumes and Legume Products"
    },
    ...
    "fdcId": 747445,
    "dataType": "Foundation"
}

ParsedIngredient class

ParsedIngredient(
    name=IngredientText(text='pinto beans', confidence=0.999193),
    size=None,
    amount=[IngredientAmount(quantity='6',
                             unit=<Unit('cup')>,
                             text='6 cups',
                             confidence=0.999906,,
                             APPROXIMATE=False,
                             SINGULAR=False)],
    preparation=IngredientText(text='cooked', confidence=0.999193),
    comment=None,
    purpose=None,
    category="Legumes and Legume Products"
    sentence='6 cups of cooked pinto beans'
)

Additionally, I am working in parallel to your repo, in scraping some other sources (e.g. bonappetit, saveur, food52) to add to your recipe training data since many (if not most) recipe sites and blogs adhere to the Google Structured Data for Recipes as part of SEO. Because of this, fetching entire recipe portfolios with ingredients is much more trivial these days than before.

strangetom commented 3 hours ago

At the moment the I've just used the FDC's list as a starting point for identifying the foundation foods. There isn't actually a mapping to anything specific in the FDC's list. Adding the category might be possible, but I'm not sure how difficult it would be. It may be as simple as just having a big dict mapping the foundation foods to the category.

More data is always better, but it's best if the any new data adds something to the training data that is currently lacking. For example, the BBC dataset largely uses metric units, where the others use US customary units; the allrecipes dataset has lots of branded ingredient names, where the others don't; the cookstr dataset includes lots of long and complex sentence, where other datasets include mostly shorter and simpler sentences.