hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.6k stars 505 forks source link

Cleanup: Reorders all test json files to have same order #1124

Closed jknndy closed 4 weeks ago

jknndy commented 1 month ago

Opened as a result of recently opened & merged #1123

Reorders all test json files to fit the following order, completely automated so open to suggestions in order!

Will update the json auto creation and any docs (if applicable) once an order is finalized

KEYS = [
    "author",
    "canonical_url",
    "site_name",
    "host",
    "language",
    "title",
    "ingredients",
    "ingredient_groups",
    "instructions",
    "instructions_list",
    "category",
    "yields",
    "description",
    "total_time",
    "cook_time",
    "prep_time",
    "cuisine",
    "cooking_method",
    "ratings",
    "ratings_count",
    "equipment",
    "reviews",
    "nutrients",
    "image",
    "keywords",
]

cc. @mlduff related to past discussion

strangetom commented 1 month ago

Hi @jknndy

What is the thinking behind the order you've proposed? Is it ordered by keys for mandatory tests, then keys for optional tests?

Is this something we can use a pre-commit hook to enforce? The current hook we have for formatting json has --no-sort-keys set, but nothing to enforce a particular order.

jayaddison commented 1 month ago

As a code reviewer, I'd find reliable ordering of the JSON files useful; there tend to be certain fields that I check more rigorously than others.

As a developer, it would be nice if the templated JSON produced by generate.py used the same key ordering, so that I wouldn't have to know/remember to reformat the JSON file after entering the expected test results.

jknndy commented 1 month ago

What is the thinking behind the order you've proposed? Is it ordered by keys for mandatory tests, then keys for optional tests?

No particular order, I structured in a way that seemed like it'd be most consumable. Grouping related outputs & moving things that return the most relevant info towards the start.

Is this something we can use a pre-commit hook to enforce? The current hook we have for formatting json has --no-sort-keys set, but nothing to enforce a particular order.

I would think so, I didn't test anything here but can't see why it couldn't be implemented

As a code reviewer, I'd find reliable ordering of the JSON files useful; there tend to be certain fields that I check more rigorously than others.

Exactly my thought, was getting harder & harder to quickly spot certain fields as previously missed coverages were appended to the bottom.

As a developer, it would be nice if the templated JSON produced by generate.py used the same key ordering, so that I wouldn't have to know/remember to reformat the JSON file after entering the expected test results.

I am planning to correct this once an order is finalized

jayaddison commented 1 month ago

@jknndy I found it difficult to write down a recommended ordering for the field list until first categorizing the fields into a few groups.

I also found it difficult to determine whether fields are mandatory in the sense of scraper-should-implement-them as compared to my understanding of a legal definition of a recipe (which doesn't include images or yields for example -- but in practice I think these are valuable for many users).

Here's what I've got so far:

Implementation Details

  * host (required)

Recipe Publication Details

  * canonical_url (required)
  * site_name (required)
  * author (required)
  * language (required)

Recipe Contents

  * title (required)
  * ingredients (required)
    * ingredient_groups
  * instructions (required)
  * total_time (strongly-recommended)
    * prep_time
    * cook_time
  * yields (strongly-recommended)

Recipe Metadata (all optional)
  * image (strongly-recommended)
  * description
  * nutrients
  * cuisine
  * category
  * keywords
  * equipment
  * cooking_method
  * ratings
    * ratings_count

Nested fields indicate that the top-level field is required when the nested field is implemented.

I've omitted ingredients_list because I think of it as an implementation quirk/detail.

jknndy commented 1 month ago

This looks good to me, I forgot to include dietary_restriction in my original list so I'll add it into yours above keywords.

Are we good to move forward with this list?

jayaddison commented 1 month ago

No objections from me!

jknndy commented 1 month ago

@jayaddison / @strangetom , thoughts on pre-populating all possible fields in the json files now and letting the developer remove the coverages that aren't available?

strangetom commented 1 month ago

I'd be in favour of that. It makes in clearer what the possible fields are and if the developer accidentally left a field in that the scraper didn't implement, there would be an exception when the tests are run.

jayaddison commented 1 month ago

Mostly-agree; one possible counter-argument: it could confuse some developers by making them think that they have to enter data and then implement all of the json fields. Would we have a way to make it clear that some can be removed?

strangetom commented 1 month ago

An option would be to catch the exception raised if a scraper doesn't implement one of the json keys and output a message saying that the entry should be removed from the testjson if the scraper doesn't implement that function.

for key in OPTIONAL_TESTS:
    with self.subTest(key):
        scraper_func = getattr(actual, key)
        if key in expect.keys():
            try:
                self.assertEqual(
                    expect[key],
                    scraper_func(),
                    msg=f"The actual value for .{key}() did not match the expected value.",
                )
            except Exception as e:
                print(f"If the scraper does not implement .{key}(), then remove the entry from the .testjson file.")
                raise e
jayaddison commented 3 weeks ago

Do we have a way to check and/or auto-reorder newly added JSON test data? (it's OK if not to begin with, although I expect we might want those at some point if we don't already)