hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.73k stars 531 forks source link

Support other mealkits than HelloFresh (Purple Carrot, GreenChef, MarleySpoon, BlueApron, EveryPlate, HomeChef, Dinnerly, Sunbasket, #193

Closed bredowmax closed 1 year ago

bredowmax commented 4 years ago

I love that you support HelloFresh! There are quite a few other meal kits that also expose their recipes - I would love if you could include them!

Example recipes from meal kits include: https://www.purplecarrot.com/recipe/roasted-cauliflower-lentil-bowl-with-avocado-curried-balsamic-vinaigrette https://cdn2.greenchef.com/uploaded/5f08a68ef6ec4700147ba8a1.pdf https://marleyspoon.com/menu/52491-lemon-herb-chicken-with-garlicky-yogurt-green-beans https://www.blueapron.com/recipes/bbq-chickpeas-farro-with-corn-cucumbers-hard-boiled-eggs-3 https://www.everyplate.com/recipes/garlic-rosemary-chicken-5efde75bfff7c66c36680eca?week=2020-W31 https://www.homechef.com/meals/steak-and-bacon-blue-cheese-butter https://dinnerly.com/menu/50620-skillet-ravioli-lasagna-with-mozzarella-parmesan https://sunbasket.com/protein/boneless-skinless-chicken-breast-strips

--- EDIT INTO CHECKLIST ---

ptindall commented 4 years ago

I just submitted pull request for sunbasket.

https://github.com/hhursev/recipe-scrapers/pull/235

webbastelbude commented 3 years ago

Did somebody look at dinnerly.com by any chance ? I have the impression that they are loading the recepie data via Javascript. When I look at the sourcecode in the browser or try the

python3 generate.py Dinnerly

command the sourcecode doesn't contain any recipe data. Any idea on how to get arround that ?

micahcochran commented 3 years ago

@webbastelbude Your suspicion is correct on it, Dinnerly, loading recipe data using Javascript.

I used dryscrape to render the page and was able to get the content. The problem with that is that it requires Qt and QtWebKit. Apparently, QtWebKit is End of Life, which mean dryscrape is not maintained. (The tutorial at the bottom of this tutorial, told me about dryscape.) The resulting HTML isn't schema.org/Recipe format. This could still be useful for a first draft of the driver.

I tried a few more similar libraries, but did not get results. Any python package that renders javascript will most likely have some pretty heavy requirements.

micahcochran commented 3 years ago

thespruceeats.com already has a parser, but when Javascript is ran it populates LD+JSON in the schema.

Here is the code for dryscrape in case you want to test this out with any URL that you suspect uses Javascript datahiding. It will write a file out.html

import dryscrape

url = "https://www.thespruceeats.com/wonton-soup-5074586"

sess = dryscrape.Session() 
sess.visit(url)
source = sess.body()

with open("out.html", "w") as fp:
    fp.write(source)
stroodle96 commented 2 years ago

Homechef.com was added in #512. Marleyspoon.com was added in #534

I submitted #535 for everyplate.com.

mattjmeier commented 2 years ago

I submitted #578 for Chef's Plate - didn't realize there was already an issue specifically for mealkits. Maybe it can be added to this list.