hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.73k stars 531 forks source link

Should informational lines be included in ingredients? #711

Closed jayhale closed 1 year ago

jayhale commented 1 year ago

Consolidating feedback regarding informational lines here, and closing per-scraper feedback (#712).

Issue

Currently scraper.ingredients() includes informational lines that do not represent ingredients. See below for examples. These lines aren't ingredients, but carry information about preparation, most often by grouping ingredients.

Possible resolutions

Group ingredients: Expose a new scraper.grouped_ingredients() that retains the grouping information available from some pages, or defaults to a single group. Each group could include information such as a title (e.g., For the Dough).

Ignore informational lines: Spec that scrapers implement scraper.ingredients() in a manner that avoids informational lines if at all possible.

Impacted scrapers

Examples have been identified for these scrapers:

Examples

For https://www.seriouseats.com/new-england-greek-style-pizza:

# Contents of `scrape_me(url).ingredients()`:

For the Dough:                      # Undesired informational line
"400 grams (14 ounces, about 2 1/2
4 grams (about 1 teaspoon) inst...
8 grams (about 1 tablespoon) ko...
2 tablespoons extra virgin oliv...
260 grams (about 1 cup plus 2 t...
For the Sauce:                      # Undesired informational line
2 tablespoons extra virgin oliv...
"2 medium cloves garlic, grated...
1 teaspoon dried oregano
1/4 teaspoon red pepper flakes
3 ounces tomato paste
1 (28-ounce) can crushed tomatoes
Kosher salt
To Assemble:                        # Undesired informational line
"2 tablespoons vegetable shorte...
2 tablespoons extra-virgin oliv...
8 ounces freshly grated whole m...
8 ounces freshly grated white c...
2 ounces Parmigiano-Reggiano

For https://www.finedininglovers.com/recipes/bbq-watermelon-sashimi-secrets-fine-dining:

# Contents of `scrape_me(url).ingredients()`:
For the Watermelon Sorbet           # Undesired informational line
Watermelon
Lime
Syrup
For the Miso Glazed Watermelon      # Undesired informational line
Watermelon
Dark Miso
Sake
Mirin
Soy Sauce
For the Cured & BBQed Watermelon    # Undesired informational line
Watermelon
Kosher Salt
Smoked paprika
Onion Powder
Ginger Powder
Chili Flakes
# ...
ggilley commented 1 year ago

I would argue that this is important information to the recipe. It relates the ingredient list to the instructions.

jayhale commented 1 year ago

@ggilley agreed. Perhaps this would be better represented as groups of ingredients, since that is the semantic intent. However, I expect the method .ingredients() to return only ingredients (e.g., can be readily used for questions like "How many ingredients does this recipe have?").

ggilley commented 1 year ago

Agreed that if the site has formatting information that separates the informational lines, it should be captured here. However, in general deciding that an ingredient line is an ingredient or informational is a hard problem and I don't think it belongs in the scraper.

ggilley commented 1 year ago

A simple way of dealing with them could be to prefix the informational line with a prefix like '# '. That way the ordering is preserved and you have an indicator of the special nature of the line.

jayaddison commented 1 year ago

This sounds similar to #301 - the indicated ingredient lines are groupings.

The NIH scraper includes experimental support for an IngredientGroup dataclass - could that be relevant here too?

hhursev commented 1 year ago

However, in general deciding that an ingredient line is an ingredient or informational is a hard problem and I don't think it belongs in the scraper. 💯

In this version of the package as well as in the future, .ingredients() will behave as it is now.

On a side note, pip install recipe-scrapers[extras] version of the packages is evaluated. In it more serious tools will be incorporated that will fit your needs better.

Closing as this won't be addressed in the (3-6) months to come. Apologies if it's a really wanted feature