hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.75k stars 536 forks source link

Broken Scrapers as on May 22, 2020 #162

Closed PatrickPierce closed 1 year ago

PatrickPierce commented 4 years ago

These are the following issues I found with current scrapers. I will update the list as I check the others.

Issues

https://www.allrecipes.com https://www.bonappetit.com https://cookpad.com/ https://www.cookstr.com/ https://copykat.com https://geniuskitchen.com https://giallozafferano.it/ https://gonnawantseconds.com/ https://healthyeating.nhlbi.nih.gov/ https://heinzbrasil.com.br/ https://hellofresh.com/ https://justbento.com/ https://www.matprat.no/ https://www.seriouseats.com https://www.southernliving.com/ https://steamykitchen.com/ https://www.thespruceeats.com/ https://thehappyfoodie.co.uk/ https://www.twopeasandtheirpod.com/ https://whatsgabycooking.com/ https://www.yummly.com/


PatrickPierce commented 4 years ago

Think I tested them all. I did not test any scrapers added in the last 30 days on the assumption that they are still on-line and have not changed or updated their layout.

I only tested the following properties.

hhursev commented 4 years ago

Thanks for the input! I'll try to square away the majority of the issues this weekend 🤞🤞

hhursev commented 4 years ago

Sorry for the massive delay with this. Promise I'll take a look and thank you for the time you spent working on this

bfcarpio commented 3 years ago

I just tested cookpad.com and found it to be working. We likely fixed the issues with schema improvements. Checking the box.

weightwatchers-carlanderson commented 3 years ago

@hhursev is this a good thread to bring up the issue that tests based on cached HTML can provide a false sense of security? There is a test for realsimple in this testsuite--it passes--but for that same URL, I'm seeing

scraper = scrape_me('https://www.realsimple.com/food-recipes/browse-all-recipes/vanilla-cheesecake')
print(scraper.title())
print(scraper.total_time())
print(scraper.yields())
print(scraper.ingredients())
print(scraper.instructions().split("\n"))
print(scraper.image())
print(scraper.host())

is only producing

540
9 item(s)
[]
['']
https://imagesvc.meredithcorp.io/v3/mm/image?url=https%3A%2F%2Fstatic.onecms.io%2Fwp-content%2Fuploads%2Fsites%2F23%2F2013%2F02%2F25%2Fginger-graham-crust.jpg
realsimple.com

i.e. it no longer gets the ingredients or instructions. It looks like the css classname has changed in the HTML.

Would it be useful to also have test suite that can be configured to run live requests to the sites?

jayaddison commented 3 years ago

@weightwatchers-carlanderson You should be able to run the existing tests using live requests by running the test suite with the --online flag via pytest. For example: py.test tests --online.

weightwatchers-carlanderson commented 3 years ago

@jayaddison Thanks for sharing.

With cached HTML, I'm seeing 1091 test passed.

However, running online test suite, I'm seeing 157 failed, 855 passed, 79 skipped, 2 warnings which is one or more tests failing in 56 different sites. This is really masking that much of this functionality is broken. I know that scraping is super fragile and breaks all the time due to CSS and other changes in source sites but is there a way to expose which of these scrapers is functional at the current time and where we need help from the community to help fix and update these scrapers?

These live tests are slow to run but how about running the suite for each release and producing some table to expose which site are actually functional?

jayaddison commented 3 years ago

Good feedback, thanks @weightwatchers-carlanderson. Given that most of the scraper tests are checking individual properties of each scrape result, that pass rate is, if anything, a bit higher than I'd expected :)

Running the tests per-release is an interesting idea; in practice I worry we'd encounter network connectivity issues and other intermittent problems like site outages (and potentially rate limiting) often enough that it'd cause more problems than the value it would provide (especially if the results can't be guaranteed accurate).

There is one idea I can suggest which would seem to solve a lot of the problems. If we had access to an archive of versioned copies of each of the recipe pages, then we could test the scrapers against those without risk of network issues at statistics-gathering-time.

What do you think might be a sensible way ahead? (and would you be interested in helping out?)

weightwatchers-carlanderson commented 3 years ago

If we had access to an archive of versioned copies of each of the recipe pages

@jayaddison I don't understand. We do have such an archive. It is the test_data/*.testhtml and that is the whole problem here. Any archive will be out of sync from the real world and people who are trying to run scrapers are getting the impression that scraper suite is fully functional (from the non online test suite) but in reality 56 sites are broken in some way if you were to run them in the real world today. A dashboard of state once per release is out of date too but it is at least better information for a user.

I understand the concern about timeout and so on, but I was thinking of some some script that would run the online test suite and write the output to file. A python script could parse the text output and produce a super simple HTML file with a table, one row per site with: green (all test for this site passed), red (tests ran but 1 or more failed for this site) and say yellow (test could not run or was skipped). There are probably ways to make this more robust to timeout and other issues.

Anyway, this HTML document could be produced as an artifact of the release build process

jayaddison commented 3 years ago

@weightwatchers-carlanderson Producing a scraper health report like that is a good idea; the details could be useful to discuss a bit more, though. Should we break this topic out into a feature request?

weightwatchers-carlanderson commented 3 years ago

@jayaddison yes, good idea. I can kick that off later today.

jayaddison commented 2 years ago

Note-to-self: visit the remaining unsolved items here and create separate tickets for currently-broken scrapers.

hhursev commented 1 year ago

Closing due to inactivity.

Health-check step in the CI is considered though to automate this.