hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.66k stars 514 forks source link

Data driven tests refactor #944

Closed strangetom closed 9 months ago

strangetom commented 9 months ago

This PR implements the data driven testing architecture proposed by @hay-kot in #807. It's based on @hay-kot's proof of concept PR in #817.

Making this change should make it easier to keep the scraper quality more consistent across the library by requiring more rigour in the testing, but (hopefully) in a way that doesn't make it more difficult to develop the scraper and associated test.

Overview

The major change this introduces is to remove the ScraperTest subclass created to test each scraper and instead define a .testjson file which contains the expected output from each function of the scraper.

The files needed to test each scraper are located in the test directory:

/tests
    __init__.py
    /test_data
        /<domain>
            - test_1.testhtml
            - test_1.testjson
        /<domain>
            - domain_test.testhtml
            - domain_test.testjson

where <domain> is the domain for the scraper. This has to be the same as the output from Scraper.host() because this is how the tests determine which scraper to use. Within each <domain> directory are pairs of .testhtml and .testjson files. The two files in each pair must have the same name, but the name can be anything. Each <domain> directory can have as many pairs of .testhtml and .testjson files are we like.

unittest is still used as the testing framework. To run all the tests:

python -m unittest

To run a single test:

python -m unittest -k <file_name>

where <file_name> is the name of the test file in the domain directory, without any file extension.

This works be defining a load_tests function in tests/__init__.py. unittest picks this up automatically and uses it to load the tests. In this load_tests function, a testing class definition is dynamically built using all the test files found in the test_data directory. A major difference from the previous approach is that we only have one test per test file, which checks all the functions, instead of one test for each function of each scraper.

The tests for the library/utility functions are unchanged and are included when all the tests are run.

.testjson

The .testjson file is a json file that defines the expected output if the scraper is run on the associated .testhtml file. The json keys are split into two types, which are tested in slightly different ways.

Mandatory tests

Mandatory tests test the mandatory functions listed in docs/in-depth-guide-scraper-functions.md. These are the functions every scraper is expected to implement. The mandatory tests are:

MANDATORY_TESTS = [
    "author",
    "canonical_url",
    "host",
    "description",
    "image",
    "ingredients",
    "ingredient_groups",
    "instructions",
    "instructions_list",
    "language",
    "site_name",
    "title",
    "total_time",
    "yields",
]

If the key is not present in the .testjson file, we check that an Exception is raised if the function is called.

Optional tests

Optional tests are for the optional functions listed in docs/in-depth-guide-scraper-functions.md. If the key is present in the .testjson, the output of the function is checked. If the key is not present, we just skip over it. The optional tests are:

OPTIONAL_TESTS = [
    "category",
    "cook_time",
    "cuisine",
    "nutrients",
    "prep_time",
    "ratings",
    "reviews",
]

generate.py

I've updated generate.py so that when a new scraper is created, the testhtml is saved to the appropriate directory in tests/test_data and a default .testjson is generated. The default .testjson contains the keys for the mandatory tests. The host value populated automatically, but the other values are empty strings.

Requested feedback

This PR is largely complete, however there are a couple of areas that I would like some help with.

  1. There are a few scrapers whose tests haven't been migrated to this data driven architecture:

    • gousto
    • kptncook
    • marleyspoon
    • woolworths

    The reason is that these scrapers use an addition network request to obtain the recipe information instead of extracting it from the recipe page HTML. I'm not sure how to fit them into this. My only thought so far is to keep them in their current form and put them in a tests/legacy folder.

  2. If you run the all the tests in this PR, you should see 2 failures and 1 error (and 375 successes!). The error is due to a duplicate scraper I think we can remove. One failure is due to the USAPears scraper return NaN from .ratings(), the other is due a recipe having inconsistent ingredients in the HTML markup and json+ld.

  3. The test_multiple_instructions test that was part of the ScraperTest base class is not implemented. This is because this test was intended to be overridden in the sub classes, so it could be skipped if necessary, but there isn't a way to do that here.

  4. I'd like to recommend we run a json auto-formatter on the .testjson files as part of the pre-commit hooks. We might need to change the file extension to .json for that to work.

strangetom commented 9 months ago

945 has fixes for two of the failing tests.

I can't fix the issue with the test_tasteau_2 because the website appears to be geo restricted. I think the fix that's required here is to replace the testhtml for this test with a recipe that has consistent HTML and json+ld.

strangetom commented 9 months ago

All the tests failures are resolved, now that #945 has been merged. The taste.com.au test that was causing issues has also been replaced - turns out that wasn't a geo restriction issue I was having.

jknndy commented 9 months ago

Wow! My initial impression is that this is much cleaner to navigate, its nice to have things neatly sorted near each other and I can't believe how many fewer tests there are with this method (which are all passing now btw)

I'm not too familiar with JSON autoformatters but after a little research Prettier seems to be one of the more popular choices, and conveniently can be run against non-standard named json files

To run locally I installed prettier, created a .prettierrc file in my base directory containing....

{
    "overrides": [
      {
        "files": ["*.testjson"],
        "options": {
          "parser": "json"
        }
      }
    ]
  }

and then ran

npx prettier --config .\.prettierrc --write './tests/test_data/*/*.testjson'

Doesn't seem to have changed anything in the files on my run but it's good info regardless!

jayaddison commented 9 months ago

The reason is that these scrapers use an addition network request to obtain the recipe information instead of extracting it from the recipe page HTML. I'm not sure how to fit them into this. My only thought so far is to keep them in their current form and put them in a tests/legacy folder.

That's a good idea. They're currently removed from the v15 branch for similar reasons. I guess we won't know how much demand there is for that functionality until we remove it, but if we can separate the two decision points (1: do we support multi-network-requests, 2: switch to JSON-based tests) then I think it'll all be a lot easier to discuss and modify/revert each of those choices individually, if needed.

In other words: one thing at a time, so yep - keeping those scrapers (and their tests) around would be nice for now, in my opinion. I don't think we're in a hurry.

If you run the all the tests in this PR, you should see 2 failures and 1 error (and 375 successes!). The error is due to a duplicate scraper I think we can remove. One failure is due to the USAPears scraper return NaN from .ratings(), the other is due a recipe having inconsistent ingredients in the HTML markup and json+ld.

:+1: - resolved by your changes in #945, I think? :tada:

The test_multiple_instructions test that was part of the ScraperTest base class is not implemented. This is because this test was intended to be overridden in the sub classes, so it could be skipped if necessary, but there isn't a way to do that here.

I feel some small amount of attachment to that test because I think it's a useful sanity check, but it could be reintroduced in some other form another time.

I have a long-term thought that perhaps instructions should natively be a list (#575) -- and if so, it'd be fairly obvious to humans reading the testjson contents whether instructions was multi-item or not.

So... no strong feelings either way here, only some thoughts.

I'd like to recommend we run a json auto-formatter on the .testjson files as part of the pre-commit hooks. We might need to change the file extension to .json for that to work.

Agreed, good idea - and yep, renaming to .json seems worthwhile too - perhaps a few other code editors and tools would autodetect the file format as a result, and that could be convenient. I'm also +1 on @jknndy's suggestion of using prettier as a utility to do the link checking.

strangetom commented 9 months ago

In other words: one thing at a time, so yep - keeping those scrapers (and their tests) around would be nice for now, in my opinion. I don't think we're in a hurry.

I've restored those tests in the tests/legacy folder so they run now whenever python -m unittest is run.

I feel some small amount of attachment to that test because I think it's a useful sanity check, but it could be reintroduced in some other form another time.

I have a long-term thought that perhaps instructions should natively be a list (#575) -- and if so, it'd be fairly obvious to humans reading the testjson contents whether instructions was multi-item or not.

So... no strong feelings either way here, only some thoughts.

I wasn't too keen to remove that test, given how recently it was added. It might possible to restore it or something similar in this data driven approach, but I'm not sure how yet. I agree with #575 - instructions as a list by default makes sense to me because that is how I want to use them most of the time.

Agreed, good idea - and yep, renaming to .json seems worthwhile too - perhaps a few other code editors and tools would autodetect the file format as a result, and that could be convenient. I'm also +1 on @jknndy's suggestion of using prettier as a utility to do the link checking.

It turns out the pre-commit comes with a pretty-format-json hook out of the box, so I've updated the .pre-commit-config.yaml to make use of it.

strangetom commented 9 months ago

I can't give a very accurate estimate, but I think I did the bulk of it in 4-5 hours over a weekend and then probably a couple of hours over some evenings. I did also waste a bunch of time thinking I had an alternative to @hay-kot's approach only to later find it wouldn't work.

jayaddison commented 9 months ago

Very nice work, thank you @strangetom!

jayaddison commented 7 months ago

@strangetom something I'm wondering about at the moment: if and when an assertion fails, test_func exits -- and test_func is called once per scraper.

When all tests for that scraper pass, or there's exactly one assertion failure for it, that's no problem - but if multiple failures occur for it, then the developer/tester is only provided with an error message about the first failure encountered (and may have to re-run the test after fixes/edits are applied).

If that matches your understanding too (I could be mistaken, but I think this is the behaviour at the moment), could we fix that by adapting test_func into a generator function, or something similar (yield a function for each field assertion, for example)?

strangetom commented 7 months ago

Yes, you're right, and I agree it's not very helpful behaviour.

I think there's a simpler fix:

        ...
        # Mandatory tests
        # If the key isn't present, check an assertion is raised
        for key in MANDATORY_TESTS:
            with self.subTest(key): # <---- Add this context manager
                scraper_func = getattr(actual, key)
                if key in expect.keys():
                    self.assertEqual(
                        expect[key],
                        scraper_func(),
                        msg=f"The actual value for .{key}() did not match the expected value.",
                    )
                    ...

This will then output something along the lines of

FAIL: tests/test_data/pickuplimes.com/pickuplimes.json (tests.RecipeTestCase.tests/test_data/pickuplimes.com/pickuplimes.json) [instructions_list]

for each failed assertion.

I'll open a PR later today.