hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.68k stars 520 forks source link

Feature Request: system health information #321

Open weightwatchers-carlanderson opened 3 years ago

weightwatchers-carlanderson commented 3 years ago

Test suite passing tests != system health While there is a robust test suite (1000+ tests at current time), it is offline, i.e. based on checked in, cached HTML webpages. While those help ensure that refactoring of code doesn't break existing functionality, it doesn't protect against changes in websites in the real world. Web scraping is notoriously fragile. Domains change content and CSS classes all the time. Thus, a test that passes on the cached HTML does not necessarily work on the live version of the site.

Today, I ran the test suite in default offline mode and 1091 tests passed. I then ran it with the --online option and 157 tests failed. Importantly, this meant that scrapers for 56 different sites were broken. The code would not work as advertised based on the offline tests when used with live requests. In other words, the 1091 green lights for the default offline test suite masked that many of these scrapers are broken.

A suggested solution Ideally, online tests would be used for every test suite run. This would provide a true signal of repo health. However, they are very slow and are subject to timeouts and other issues that could cause a test failure, independent of the code being tested.

As a compromise, I wonder whether we could run some system health for each release. While this might still represent stale information if a site changed its CSS after that release, it would at least provide some more current information and more accurate representation of system health if the code were used in the real world as best we can determine at release date.

As part of the build process, I wonder if we could run a script (using github action?) that would produce a static artifact that showed system health. For instance, a simple script could run the test suite in online mode, parse the output and produce an HTML document with a simple table: one row per site:

(One could imagine that it could also automatically generate tickets for fixes for those broken sites.)

Would be curious as to people's thoughts here

jayaddison commented 3 years ago

Having an automatically updated report to display scrapers health sounds great; it'd be useful for maintainers, users and contributors alike :+1:

The green, yellow, red mapping sounds like it'd fit nicely with test result states; so hopefully we could achieve a lot of this using widely-available Python test reporting tools, which is good news.

As you suggest, one option is that we could run the tests against the web, live, on each run. This is my main area of concern at the moment, because I think it'd be ideal if the health reports could be:

Reliability should mean that the healthcheck is near-guaranteed to succeed whenever we run it; so that the team doesn't have to spend time investigating failures to produce the report itself.

Repeatable means that if the healthcheck report runs with a fixed set of inputs (particularly the timestamp for the time-of-run, and the commit reference of recipe-scrapers that we're using), then we should receive the same output healthcheck report, giving us confidence about the validity of the contents.

Actionable is closely tied to repeatable, and it means that when someone notices a problem in a healthcheck report, they're able to dig in further and use the precise circumstances and input state to figure out what change occurred (whether it's success-to-failure, performance degradation, or any other type of change).

I don't want to prescribe any solutions here, though I do have some relatively strong (but open to debate) opinions about a way forward. HTML content is required as input for any implementation here -- either live from the web, archived from the repository, or elsewhere -- and I also believe that many system architectures work best when each (small) component is maintained by a team with niche domain expertise.

Those two ideas make me think that using the Internet Archive's Wayback Machine could be a good match here. It can provide HTML content with strong timestamping and integrity guarantees, and it is their goal and expertise to collect and make that content reliably available. In addition there is a wayback Python library that supports timestamp-based content retrieval -- so I think we could, for example, run a healthcheck report that asks for latest version of {page} no later than {timestamp} for each scraper test, and that would provide a healthcheck report that meets the three criteria outlined above.

Edit 1: add missing word 'same' in 'repeatable' principle Edit 2: don't use hash symbols in edit numbering because it links to pull requests :)

weightwatchers-carlanderson commented 3 years ago

@jayaddison great points. I just noticed that some of these failing online tests are failing because of content changes only--the website changed the wording of the title, the food prep string, or the image URL. The code extracted the right information but it it didn't match the cached reference strings. Those are thus not repeatable or reliable and any exact string matching like this are brittle tests and are providing some false signals (and in the opposite direction of the offline tests). In terms of passing tests, offline can provide false positives and online can provide false negatives.

The wayback ideas sounds interesting. It provides some stability and the no later than {timestamp} provides some tracking with the changes in the source sites but doesn't it just reflect a lagged reality? When a site changes, at some point it will fall into that timestamp threshold and the core mechanism of comparing the (dynamic) HTML with some static, expected extracted info. Then we are back to square 1.

Perhaps we could check that it provides the right type of information: that servings returns an int 0--10, that the image URL looks like some JPG image URL, that we got a list of 10 ingredient strings, that we don't get empty strings, -1, or 'None's. This is more in line with great expectations of data. My concern here would be that too lax bounds would not catch coding errors. See https://github.com/hhursev/recipe-scrapers/issues/313 where code was returning an int but the code was really broken by a release and returning the wrong int. Conversely, if the bounds are too tight, then we run in potential for false positives based on content changes.

weightwatchers-carlanderson commented 3 years ago

one area we could reliably catch are exceptions thrown when the code doesn't find an expected set of elements, based on CSS classnames, indexing of children and so on. I'm seeing some scrapers failing in online tests with this at the moment. Those are reliable and repeatable if the code is crashing because the HTML structure doesn't match the coding logic.

jayaddison commented 3 years ago

Those are thus not repeatable or reliable and any exact string matching like this are brittle tests and are providing some false signals

In many codebases, the tests can be considered near-immutable - they shouldn't change, unless developers change their mind about how the code should function. That's not true for recipe-scrapers, because the upstream content may change and the project has little influence over that. Breakages due to that aren't a reflection of brittle tests so much as they're an expected part of catching and adapting to modified content.

The wayback ideas sounds interesting. It provides some stability and the no later than {timestamp} provides some tracking with the changes in the source sites but doesn't it just reflect a lagged reality?

Yep, it does - that's the nature of web scraping. All the code we write will encounter a lagged reality (website content as received at runtime), until there's consensus that the providers should offer their content to the public in a standard format and save energy for everyone.

Perhaps we could check that it provides the right type of information: that servings returns an int 0--10, that the image URL looks like some JPG image URL, that we got a list of 10 ingredient strings, that we don't get empty strings, -1, or 'None's. This is more in line with great expectations of data.

I think your concern about false-positives is well-founded here.

bfcarpio commented 3 years ago

Wayback

If we do something like this, I don't think we should run this check or more than once a month. I'd even say lets give us closer to 2 or 3 months. We might start chasing errors for scrapers that no one uses and be burned out when we find one that's most popular. I'd rather have a user bring this up than get the time period's report and realize there are yet another 30 broken scrapers.

Great Expectations

I'm interested in seeing how to best implement this. I think this type of testing is a good skill set to have.

If we implement this, is there a good way to explain what bounds to use? Every once in a while you'll see we have people who are new to Python / programming and want to contribute. They might not know the best way to program a test let alone what a good fuzzy bound is. I'll admit, I enjoy seeing these users and I'm willing to take the time to explain some things (regardless of how good I am at it). Keeping this repository accessible to newcomers is something I keep as a secondary goal for this repo's design.

I wonder if we could run a script (using github action?) that would produce a static artifact that showed system health.

I'm almost positive that you can use github actions to publish a "Github Pages" page. I know people used Travis to do this beforehand.

weightwatchers-carlanderson commented 3 years ago

I've been thinking about great expectations style checks. This feels the right approach to me because we should be able to configure the system to handle some changing content and not give out too many false positives that something is broken when content changes and still have firm checks that we are getting meaningful information. And, I think may be able to make configure this so that we specify a dynamic bound just once per extracted field (not for each field of each site's test but once for title, once for total time etc)

In fixing up some of the sites that are truly broken due to CSS/DOM change versus online tests that are failing due to content changes, I'm seeing changes such as:

E       - https://assets.bonappetit.com/photos/59e4d7dc3279981dd6c79847/16:9/w_1000,c_limit/pork-chops-with-celery-and-almond-salad.jpg
E       + https://assets.bonappetit.com/photos/59e4d7dc3279981dd6c79847/5:7/w_1936,h_2710,c_limit/pork-chops-with-celery-and-almond-salad.jpg
E       ?

i.e. the image method worked but the URL has changed, or I'm seeing some minor changes in the recipe title (in->an). How can we handle this slop?

For each method, we should be able to define some rule:

This kind of thing. This could be configured once so beginners don't have to worry about it

jayaddison commented 3 years ago

@weightwatchers-carlanderson Introducing those kind of checks within a separate GitHub Actions workflow (with separate Python requirements file for great_expectations) could be useful to prove the concept. Would you feel like trying that for a single scraper to begin with?

weightwatchers-carlanderson commented 3 years ago

@jayaddison sure. I have a day job so not sure about timing but I can think about it some more and create some proof of concept

jayaddison commented 3 years ago

@weightwatchers-carlanderson Sounds great - and don't feel any time pressure - we're all volunteering on our own time as well, and glad to help out with anything you're doing when we can.

jayaddison commented 1 year ago

@weightwatchers-carlanderson I've sketched out a possible basic implementation for this in #669 - it solves some of the reliability questions by using the Internet Archive's Wayback Machine as a datasource, and as a result should be repeatable and actionable (essentially: it's running the same old unit tests we already have, except with input data sourced from a different point-in-time).

Because a ton of scrapers would break if we applied this universally (meaning: lots of difficult and tedious work to fix, not practical in one changeset), it's opt-in to begin with, and 'temporal testing' can incrementally be enabled.