hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

Add Ploetzblog + configuration for unit tests #1100

Open mlduff opened 2 months ago

mlduff commented 2 months ago

This PR adds support for Ploetzblog. There was an issue when adding this scraper (as detailed here, with discussion following). This was that the assumption that ingredient groups always add up to form the ingredients list is incorrect, as it doesn't account for differing amounts of the same ingredient.

I have basically gone for the "exempt this scraper from that particular test". To do this, I could of converted it to a legacy test and implemented all the tests myself, however I figured that allowing for configuration in non-legacy unit tests could be beneficial, so I added an _options entry to the unit case which optionally specifies certain overrides for the test executor. This is done in such a way that someone would have to deliberately disable this test.

Resolves https://github.com/hhursev/recipe-scrapers/issues/1062

jayaddison commented 2 months ago

I'm in two minds about this; I agree that selectively bypassing the ingredient-consistency test could make sense for some scraper tests, and I think that the use of a key named _options in the JSON is a neat way to place that somewhere that won't conflict with the test fields. However: I'm not so sure about whether we want to support per-scraper-test options in general -- and I don't want us to add generic support for that if it makes it more likely to happen. It's an anti-pattern, basically: the JSON files are intended to be test data that we compare against, and to add configuration/conditional flags to that seems non-ideal.

mlduff commented 2 months ago

@jayaddison Those are some good points. Could another option be to have a completely separate file in each scraper folder dedicated to configuration of the test cases? In this way the configuration would apply across all tests for that scraper (if there are multiple JSON/HTML files) and it would be separated from the data. Or is this still an anti-pattern?

jayaddison commented 2 months ago

Those are all possibilities, yep! As mentioned in the issue thread, the natural way to handle this would be by marking/overriding the relevant test case and skipping it in Python code. Re-implementing that for our custom data format -- whether in an option, a config file, or elsewhere -- doesn't seem great, because it could be the start of a path down which we end up wanting to re-implement other existing features of unittest / pytest / etc.

mlduff commented 2 months ago

@jayaddison They are good points again - if we wanted to avoid re-inventing unittest, then I think the only way would be to take Ploetzblog out of the data driven approach and convert to a legacy test (I could perhaps still have it load from the custom data format?). My preference would still be to go down the configuration path, as it allows classes to leverage the ease-of-use that the custom data format brings, but I am happy to do whatever you wish.

jayaddison commented 2 months ago

Thinking about it. For the scraper tests, there are currently two categories:

What you're suggesting might be a third category - it's a test case for a scraper where for some reason the data-driven tests aren't applicable.

We probably wouldn't need to add those very often, and you could copy it more-or-less from the existing legacy tests (you probably wouldn't need the expected_requests though).

If we do that, perhaps we should try to find a better name than 'legacy tests' for the tests-that-use-multiple-HTTP-requests (and a better name than that, too).

mlduff commented 2 months ago

@jayaddison Is there some kind of way to have a base class, which represents all the data-driven test cases. This could then be inherited and overridden by the test class (Ploetzblog) with custom tests. My goal is to have as little duplication required as possible when there are minor deviations from the expected format.

This isn't something I know how to do off the top of my head, but if you're happy with that approach I could do some more research into the specifics. Otherwise, I would be happy to just create another legacy test (where I would override the test_consistent_ingredients_lists).

jayaddison commented 2 months ago

@mlduff I think that's possible, yep, but I have to check if I understand the idea: is what you're suggesting somehow similar to the way that we dynamically create a scraper class for some websites when the wild_mode flag is enabled?

jayaddison commented 2 months ago

Also, something to bear in mind: a benefit of data-driven tests is that it's possible for people to read the .json files and to check -- without having to know code for any specific scraper -- whether the data matches the corresponding test HTML file. Making it possible to override that on a per-scraper basis would begin to make it harder to reason about what the .json represents -- because the person would first need to check for existence of, and if necessary read, the override behaviours. I don't think that'd be a good process or system design.

mlduff commented 2 months ago

@jayaddison I have refactored a lot of the data driven test code into separate functions inside a newly created file data_utils.py. I then use these functions to define my custom test, under the new data_driven folder (more than happy to think of new names).

This allows for the flexibility of creating custom tests where the assumptions in the generic test are wrong, whilst still encouraging the test case to be written predominantly in JSON (greatly increasing how easy it is to understand, as you mentioned).

jayaddison commented 1 month ago

@mlduff I'd like to try to keep things as simple as possible for our developers and code reviewers, and after weighing up some of the options you've presented.. it feels like cleanest approach -- if it's possible! -- would be to parse the quantities when testing each scraper and comparing ingredients to ingredient_groups.

I've added a comment about that to #1062. I'll try to find some time for a sample implementation of it this week, but don't yet know whether I'll get around to that.

mlduff commented 1 month ago

@jayaddison If you're happy for me to I would be happy to have a go implementing it - if you'd rather me leave it for you that is fine too, just let me know.

mlduff commented 1 month ago

Worth pointing out there is some discussion here on the topic @jayaddison: https://github.com/hhursev/recipe-scrapers/pull/733