PatrickPierce commented 4 years ago

These are the following issues I found with current scrapers. I will update the list as I check the others.

Issues

~~https://www.allrecipes.com~~ https://www.bonappetit.com ~~https://cookpad.com/~~ ~~https://www.cookstr.com/~~ ~~https://copykat.com~~ ~~https://geniuskitchen.com~~ ~~https://giallozafferano.it/~~ https://gonnawantseconds.com/ https://healthyeating.nhlbi.nih.gov/ https://heinzbrasil.com.br/ https://hellofresh.com/ https://justbento.com/ https://www.matprat.no/ https://www.seriouseats.com https://www.southernliving.com/ ~~https://steamykitchen.com/~~ https://www.thespruceeats.com/ https://thehappyfoodie.co.uk/ https://www.twopeasandtheirpod.com/ https://whatsgabycooking.com/ https://www.yummly.com/

[x] Site: https://www.allrecipes.com
Test: https://www.allrecipes.com/recipe/221093/good-frickin-paprika-chicken/
Issue: Missing total time
Test 2: https://www.allrecipes.com/recipe/35753/scott-hibbs-amazing-whisky-grilled-baby-back-ribs
Issue: Missing servings
[ ] Site: https://www.bonappetit.com/
Test: https://www.bonappetit.com/story/gourmet-reeses-peanut-butter-cups
Issue: Some recipes are presented as a blog story, but actual recipes work. Add support for recipes and "story" layout.
Working example: https://www.bonappetit.com/recipe/sticky-sweet-grilled-pork-shoulder
[x] Site: https://cookpad.com/
Test: https://cookpad.com/us/recipes/12634056-tik-tok-viral-pancake-cereal
Issue: Missing data (title, servings, instructions, ingredients)
[x] Site: https://www.cookstr.com/
Test 1: https://www.cookstr.com/Chili-Recipes/Quinoa-Chili
Test 2: https://www.cookstr.com/recipes/chopped-liver-the-way-my-mother-makes-it
Issue: Missing servings. Both test have servings on the page, but they do not follow a standard. Test 1 is "Serves: 10 cups (2.36 L)" while Test 2 is "Makes: 6 to 8 appetizer servings".
[x] Site: https://copykat.com
Test: https://copykat.com/boston-market-macaroni-and-cheese-forget-the-stuff-in-the-blue-box-take-a-few-more-minutes-and-serve-up-a-tasty-home-made-macaroni-and-cheese/
Issue: Missing servings. Instructions are in a python list.
[x] Site: https://geniuskitchen.com
Issue: No longer needed. Genius Kitchen is now own by food.com
[x] Site: https://giallozafferano.it/
Test 1: https://ricette.giallozafferano.it/Muffin-con-gocce-di-cioccolato.html
Test 2: https://ricette.giallozafferano.it/Chili-con-carne.html
Issue: Servings is working for Test 2, but missing from Test 1.
[ ] Site: https://gonnawantseconds.com/
Test: https://www.gonnawantseconds.com/balsamic-chicken/
Issue: Instructions are in a python list. Missing servings. (INFO:root:exception_handling silencing exception: 'NoneType' object has no attribute 'get_text' '')
[ ] Site: https://healthyeating.nhlbi.nih.gov/
Issue: Issue with secure website certificate. Will need to test later.
[ ] Site: https://justbento.com/
Test 1: https://justbento.com/handbook/recipe-collection-mains/soft-polenta-dinner-fried-polenta-cakes-bento
Test 2: https://justbento.com/handbook/recipe-collection-mains/oven-baked-spicy-asian-chicken-wings
Test 3: https://justbento.com/handbook/recipes-sides-and-fillers/bento-filler-raw-asparagus-radish-and-parmesan
Issue: Each posting has a different layout and causes some information to scrape incorrectly. Test 1 does not scrape the instructions. Yield defaults to 1. Total time seems to default 0 or 10.
Note: If yield support is added, some recipes will display both yield and serving size (Test 2).
Note: Total time will grab the tag at bottom of page. The tag does not always match the time listed in directions. If the content in tag is not a number, it will default to 0.
- Test 1 total_time output: 10
- Test 1 Tag: Time required: 5-10 minutes.
- Test 1 Description: Prep time: 5 min :: Cook time: 30-40 min (for the base polenta) :: Total time: 35-40 minutes min (plus 5-10 min to fry the polenta cakes)
- Test 2 total_time output: 0
- Test 2 Tag: Time required: more than an hour
- Test 2 Description: Prep time: 15 min :: Cook time: 20 min :: Total time: 35 min - does not include marinating time
[ ] Site: https://www.matprat.no/
Test: https://www.matprat.no/oppskrifter/gjester/fiskegryte-med-limetouch/
Issue: Not able to return ratings (INFO:root:exception_handling silencing exception: 'NoneType' object has no attribute 'get_text' -1)
[ ] Site: https://www.seriouseats.com
Test 1: https://www.seriouseats.com/recipes/2017/06/perfect-michelada-mexican-beer-cocktail-recipe.html
Test 2: https://www.seriouseats.com/recipes/2016/10/unusual-negroni-cocktail-aperol-lillet-hendricks-recipe.html
Issue: Rating default to -1.
[ ] Site: https://www.southernliving.com/
Test: https://www.southernliving.com/recipes/caramel-pineapple-bread-pudding
Test 2: https://www.southernliving.com/recipes/banana-fritters
Issue: Instructions return as python list. Ingredient will return whitespace on the left if there is no measurement. Test 1: Chopped toasted macadamia nuts. Test 2: Canola oil and Powered sugar
[x] Site: https://steamykitchen.com/
Issue: Scraper only returns url.
[ ] Site: https://www.thespruceeats.com/
Test: https://www.thespruceeats.com/sunny-side-up-eggs-4797347
Issue: Some recipes will return empty/null for servings if there is nothing list. Not sure if a default of 0 or -1 should be added.
[ ] Site: https://thehappyfoodie.co.uk/
Test: https://thehappyfoodie.co.uk/recipes/quinoa-cakes-with-chimichurri-yoghurt
Issue: Recipes that have a listed time returns 0.
[ ] Site: https://www.twopeasandtheirpod.com/
Test 1: https://www.twopeasandtheirpod.com/sprinkle-chocolate-chip-cookies/
Test 2: https://www.twopeasandtheirpod.com/creamy-goat-cheese-arugula-farro-salad/
Issue: Rating returns as -1
[ ] Site: https://whatsgabycooking.com/
Test 1: https://whatsgabycooking.com/caramelized-onion-flatbread/
Test 2: https://whatsgabycooking.com/spicy-salmon-sushi-bowls/
Issue: Servings is not being scraped if listed.
[ ] Site: https://www.yummly.com/
Test 1: https://www.yummly.com/recipe/Easy-5-Ingredient-Chicken-Enchiladas-9073068
Test 2: https://www.yummly.com/recipe/Skillet-Chicken-Tacos-9063800
Issue: Missing total time and instructions.

PatrickPierce commented 4 years ago

Think I tested them all. I did not test any scrapers added in the last 30 days on the assumption that they are still on-line and have not changed or updated their layout.

I only tested the following properties.

ingredients
instructions
ratings
title
total_time
url
yields

hhursev commented 4 years ago

Thanks for the input! I'll try to square away the majority of the issues this weekend 🤞🤞

hhursev commented 4 years ago

Sorry for the massive delay with this. Promise I'll take a look and thank you for the time you spent working on this

bfcarpio commented 3 years ago

I just tested cookpad.com and found it to be working. We likely fixed the issues with schema improvements. Checking the box.

weightwatchers-carlanderson commented 3 years ago

@hhursev is this a good thread to bring up the issue that tests based on cached HTML can provide a false sense of security? There is a test for realsimple in this testsuite--it passes--but for that same URL, I'm seeing

scraper = scrape_me('https://www.realsimple.com/food-recipes/browse-all-recipes/vanilla-cheesecake')
print(scraper.title())
print(scraper.total_time())
print(scraper.yields())
print(scraper.ingredients())
print(scraper.instructions().split("\n"))
print(scraper.image())
print(scraper.host())

is only producing

540
9 item(s)
[]
['']
https://imagesvc.meredithcorp.io/v3/mm/image?url=https%3A%2F%2Fstatic.onecms.io%2Fwp-content%2Fuploads%2Fsites%2F23%2F2013%2F02%2F25%2Fginger-graham-crust.jpg
realsimple.com

i.e. it no longer gets the ingredients or instructions. It looks like the css classname has changed in the HTML.

Would it be useful to also have test suite that can be configured to run live requests to the sites?

jayaddison commented 3 years ago

@weightwatchers-carlanderson You should be able to run the existing tests using live requests by running the test suite with the --online flag via pytest. For example: py.test tests --online.

weightwatchers-carlanderson commented 3 years ago

@jayaddison Thanks for sharing.

With cached HTML, I'm seeing 1091 test passed.

However, running online test suite, I'm seeing 157 failed, 855 passed, 79 skipped, 2 warnings which is one or more tests failing in 56 different sites. This is really masking that much of this functionality is broken. I know that scraping is super fragile and breaks all the time due to CSS and other changes in source sites but is there a way to expose which of these scrapers is functional at the current time and where we need help from the community to help fix and update these scrapers?

These live tests are slow to run but how about running the suite for each release and producing some table to expose which site are actually functional?

jayaddison commented 3 years ago

Good feedback, thanks @weightwatchers-carlanderson. Given that most of the scraper tests are checking individual properties of each scrape result, that pass rate is, if anything, a bit higher than I'd expected :)

Running the tests per-release is an interesting idea; in practice I worry we'd encounter network connectivity issues and other intermittent problems like site outages (and potentially rate limiting) often enough that it'd cause more problems than the value it would provide (especially if the results can't be guaranteed accurate).

There is one idea I can suggest which would seem to solve a lot of the problems. If we had access to an archive of versioned copies of each of the recipe pages, then we could test the scrapers against those without risk of network issues at statistics-gathering-time.

What do you think might be a sensible way ahead? (and would you be interested in helping out?)

weightwatchers-carlanderson commented 3 years ago

If we had access to an archive of versioned copies of each of the recipe pages

@jayaddison I don't understand. We do have such an archive. It is the test_data/*.testhtml and that is the whole problem here. Any archive will be out of sync from the real world and people who are trying to run scrapers are getting the impression that scraper suite is fully functional (from the non online test suite) but in reality 56 sites are broken in some way if you were to run them in the real world today. A dashboard of state once per release is out of date too but it is at least better information for a user.

I understand the concern about timeout and so on, but I was thinking of some some script that would run the online test suite and write the output to file. A python script could parse the text output and produce a super simple HTML file with a table, one row per site with: green (all test for this site passed), red (tests ran but 1 or more failed for this site) and say yellow (test could not run or was skipped). There are probably ways to make this more robust to timeout and other issues.

Anyway, this HTML document could be produced as an artifact of the release build process

jayaddison commented 3 years ago

@weightwatchers-carlanderson Producing a scraper health report like that is a good idea; the details could be useful to discuss a bit more, though. Should we break this topic out into a feature request?

weightwatchers-carlanderson commented 3 years ago

@jayaddison yes, good idea. I can kick that off later today.

jayaddison commented 2 years ago

Note-to-self: visit the remaining unsolved items here and create separate tickets for currently-broken scrapers.

hhursev commented 1 year ago

Closing due to inactivity.

Health-check step in the CI is considered though to automate this.

hhursev / recipe-scrapers

Broken Scrapers as on May 22, 2020 #162

Issues