ssnepenthe / recipe-scraper

A library for scraping recipes from popular recipe sites.
GNU General Public License v2.0
47 stars 15 forks source link

Breaking Ingredients section #18

Closed vjaykoogu closed 5 years ago

vjaykoogu commented 6 years ago

How about breaking #ingredients sections into below Ex:

array(4) { ["quantity"]=> string(1) "1" ["unit"]=> string(3) "lb." ["info"]=> string(19) "peeled and deveined" ["name"]=> string(6) "shrimp" }

ssnepenthe commented 6 years ago

@vjaykoogu thanks for the suggestion!

I am open to being convinced otherwise, but I think this goes beyond the scope of a simple scraper.

The problem is that few, if any, of these sites have any sort of structured markup related to the various parts of an ingredient line.

If we can't reason about the structure of an ingredient line from the markup provided, we end up having to write a complete ingredient parser.

Unfortunately there is no real standard for ingredients - This SO answer does a pretty good job covering some of the complexity you might find.

With that in mind, I think this would be better suited for an entirely separate library.

In fact, the New York Times has put out a pretty solid looking python tool for just this purpose:

https://github.com/NYTimes/ingredient-phrase-tagger

And a corresponding blog post with a little background information:

https://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/