hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

Add README inclusion test #1105

Closed mlduff closed 3 weeks ago

mlduff commented 2 months ago

This PR adds a new test that checks the available scrapers list in README.rst for all of the supported hosts in SCRAPERS. I think that this is useful because I often forget to add the README entries when creating a scraper, and this will force developers to add their new scrapers to the list. Currently the comparison is agnostic to www..

After creating the test I noticed a number of scrapers which are not in the list, so I went through and added all of these so that the test would pass.

If people are happy with this PR, I will extend the test to do a comparison the other way (to check that all the listed scrapers have an associated scraper, to ensure that if scrapers are removed, they are also removed from the list).

jknndy commented 2 months ago

Nice feature! Any ideas on how we could cover multiple top level domains in one listing? (Hellofresh/others)

It may be best to leave it this way but it just feels a little redundant. Thoughts?

mlduff commented 2 months ago

@jknndy I've accepted your feedback and made a few changes, thanks heaps for the review. Is there a way to make a completely custom error message, because to me this seems a little confusing (mainly the ... is not False part):

AssertionError: ['davidlebovitz.com'] is not false : Missing scrapers in README.rst: ['davidlebovitz.com']

Would you also be happy for me to extend this test to enforce alphabetical order, and also ensure that no scrapers are included in the list that shouldn't be?

mlduff commented 2 months ago

Nice feature! Any ideas on how we could cover multiple top level domains in one listing? (Hellofresh/others)

It may be best to leave it this way but it just feels a little redundant. Thoughts?

To me it makes the most sense to leave it as is, because it most accurately describes what domains are going to work without having wild_mode=True. I'd be interested to hear in the alternatives (would you include http://hellofresh.*/ in the README instead?).

mlduff commented 2 months ago

Also @jknndy I was also wondering what you think about moving the list into its own file, rather than having everything sitting in the initial README? Would just having a "Click here to see the list of available scrapers" work better? At the moment you have to scroll past the entire list to see the bottom part of the README.

jayaddison commented 2 months ago

It could be worth validating both the name part and also the hyperlink part of each entry; sometimes they can go out of sync during editing (for example: https://github.com/hhursev/recipe-scrapers/pull/1104/commits/03ff4f32dcaac087172722b7352f32b7a1dc05df)

mlduff commented 2 months ago

It could be worth validating both the name part and also the hyperlink part of each entry; sometimes they can go out of sync during editing (for example: 03ff4f3)

@jayaddison Yep that's a good idea, could this even be done as a separate test case, responsible for checking domains matching on each line?

jayaddison commented 2 months ago

It could be worth validating both the name part and also the hyperlink part of each entry; sometimes they can go out of sync during editing (for example: 03ff4f3)

@jayaddison Yep that's a good idea, could this even be done as a separate test case, responsible for checking domains matching on each line?

Two assertions within a single test case, I think - one that checks that the hyperlink name is accurate, and one that checks that the href is correct.

jknndy commented 2 months ago

Also @jknndy I was also wondering what you think about moving the list into its own file, rather than having everything sitting in the initial README? Would just having a "Click here to see the list of available scrapers" work better?

While I agree there could be benefits to moving the coverage list into its own file I think the extra step adds some accessibility loss and someone quickly skimming the read me could miss it. On the counter point this would allow us some more flexibility in the testing / auto generation of read me enteries. Check out #820 for some old thoughts on this (most likely outdated by now)

At the moment you have to scroll past the entire list to see the bottom part of the README.

What do you think about adding a hyper link above the scraper list that says something like "contribution & development information" that scrolls past the list to the contribution header? I'm not sold on either idea being an improvement but would like to hear what you / others think?

mlduff commented 2 months ago

While I agree there could be benefits to moving the coverage list into its own file I think the extra step adds some accessibility loss and someone quickly skimming the read me could miss it

@jknndy I personally feel that it wouldn't be unreasonable to expect someone to see a link placed under a level 1 heading saying "Supported Websites" that would be placed near the top, but I get what you are saying (as it is certainly quite hard to miss the massive list 😂 ). I also understand that I am new here so I'm more than happy to go with your preference.

What do you think about adding a hyper link above the scraper list that says something like "contribution & development information" that scrolls past the list to the contribution header?

I think this could work, but to be honest probably not required, but would be interested to hear some other opinions. Would it impact the accessibility to just move this information above the list?

mlduff commented 2 months ago

@jayaddison Just saw your previous message - that is a good idea, disregard the idea about the sub-level domains. To confirm, would you rather me proceed with this, rather than the verbose, list everything thing I mentioned?

mlduff commented 2 months ago

@jayaddison @jknndy I'm working through adding the new format now, should each of the sub TLDS (.com.au, .co.uk, .at, etc.) be hyperlinks to their respective sites? So clicking on .com.au would take you to https://www.hellofresh.com.au/

mlduff commented 2 months ago

Also, to keep you both updated - this version I am working on will enforce alphabetical order, based on the full domain name (as determined by Python's sort/sorted) as I think this would greatly increase readability of the list.

jknndy commented 2 months ago

should each of the sub TLDS (.com.au, .co.uk, .at, etc.) be hyperlinks to their respective sites? So clicking on .com.au would take you to https://www.hellofresh.com.au/

Yep! That's exactly what I had in mind.

I agree with your comment about enforcing an alphabetical list as well

mlduff commented 2 months ago

I've pushed what I have so far, but I have a bit of work to do on it still, so other than the general approach please don't worry about the specifics too much at the moment. I hope to have it in a reviewable state tomorrow sometime.

mlduff commented 2 months ago

@jknndy I'm a bit stuck with PracticalSelfReliance - it's host definition takes an entire domain (rather than just a suffix) - as such my logic falls over as I can't distinguish that both creativecanning.com and practicalselfreliance.com are both full domains. Do you have any suggestions on proceeding here?

I was thinking of comparing the first section (after splitting on .), and if they are different then I know that they are both full domain names in their own right. This however would fall over if it was something like recipes.creativecanning.com and recipes.practicalselfreliance.com. Would it be possible to change the host() signature to specify sub-level/full domain vs just top-level domain? Or do you have any other ideas?

I'm really keen to proceed with getting this test in, as I have already uncovered numerous instances of no-longer supported sites being included in the README, as well as instances of sites being out of order or being omitted completely.

mlduff commented 2 months ago

@jknndy My most recent commit has what I think is a good solution - checking the default output of host if "" is passed as the argument. If "" is returned, then we know the whole domain name is defined by the domain argument. It isn't perfect, as it won't catch cases where the values being returned are a mix of TLD and complete changes, but I think it is good enough? My test also won't handle things where the country/language code precedes the domain, as I think that will just become too complicated (and I think it is best to just have everything in alphabetical order regardless).

mlduff commented 2 months ago

@jknndy @jayaddison This should be ready for another review now. I've changed the README to support the new format, and have tried to have helpful error messages.

I had to make a change to way host() works, where you either pass a different top level domain, or pass an entire string if the changes to the host are more than just a different top level domain. I was thinking if you aren't happy with this we could have a different approach where host can have the following keyword arguments:

jayaddison commented 2 months ago

I had to make a change to way host() works, where you either pass a different top level domain, or pass an entire string if the changes to the host are more than just a different top level domain. I was thinking if you aren't happy with this we could have a different approach where host can have the following keyword arguments:

Mmm, I'm not too keen on this alternative; I'll add a comment explaining some of the reasons for that soon. I think it'd be possible to use an alternative approach where we scan through SCRAPERS, group by host(), and extract common prefixes -- but I'd also like to check that that's feasible, so don't take my word for it (perhaps I've not anticipated some obstacle related to that).

mlduff commented 2 months ago

@jayaddison I figured out a way to allow subdomains to be configured through host() without having to have the whole domain set each time - I think that is all the cases covered now.

If you are happy with the current format I'll add some documentation?

jayaddison commented 2 months ago

I think the format of the README presented here is good, yep :+1: - thanks @mlduff

Additional documentation would be useful, yep.

Even though this functionality currently all exists within the test suite, I'm beginning to think of it as get_scraper_index -- an indexed directory of scrapers and their supported domain names.

I'm surprised at the number of tricky scenarios to handle when attempting an alternative implementation. Even so, I might suggest some refactoring -- either as a pull request on your repo, or possibly after we merge this.

jayaddison commented 2 months ago

@mlduff could you add a test to check that the total count of primary domains in the index matches the total number of scrapers? (I think that should be true?)

mlduff commented 2 months ago

@mlduff could you add a test to check that the total count of primary domains in the index matches the total number of scrapers? (I think that should be true?)

@jayaddison If by the number of scrapers you mean the number of unique scraper classes included in the list, I don't think so. Reason being that for scraper classes that have multiple domains associated with them, but the variance is more than the top-level-domain, each domain is treated as primary (as it would be confusing to start grouping on different things other than TLDs). An example of this is PracticalSelfReliance:

PracticalSelfReliance.host(): PracticalSelfReliance,
PracticalSelfReliance.host(domain="creativecanning.com"): PracticalSelfReliance,

The default domain is practicalselfreliance.com, meaning you will have two totally different domains. The same applies for KptnCook, where the subdomains are varying (not the TLDs), and as such both are treated as primary.

mlduff commented 2 months ago

Even though this functionality currently all exists within the test suite, I'm beginning to think of it as get_scraper_index -- an indexed directory of scrapers and their supported domain names.

@jayaddison So would this be a function available to import from recipe_scrapers? And would the output be the same as what get_supported_scrapers() currently outputs in the test? If so, I'd be happy to make this refactor.

I was also planning on some changes to improve the error messaging to help users know if a domain name is spelt wrong, or if it is just out of order. Will write up the documentation too. I was thinking of making a dedicated section in between "Create the test" and "Open a pull request" in how-to-develop-scraper.md called "Update the README" with in-depth instructions on how to do so.

jayaddison commented 1 month ago

@mlduff could you add a test to check that the total count of primary domains in the index matches the total number of scrapers? (I think that should be true?)

@jayaddison If by the number of scrapers you mean the number of unique scraper classes included in the list, I don't think so. Reason being that for scraper classes that have multiple domains associated with them, but the variance is more than the top-level-domain, each domain is treated as primary (as it would be confusing to start grouping on different things other than TLDs). An example of this is PracticalSelfReliance:

Agreed, there were some obstacles to this idea in the existing host mapping. I made some adjustments in #1118 - does that unblock the idea, or is it still impractical?

Even though this functionality currently all exists within the test suite, I'm beginning to think of it as get_scraper_index -- an indexed directory of scrapers and their supported domain names.

@jayaddison So would this be a function available to import from recipe_scrapers? And would the output be the same as what get_supported_scrapers() currently outputs in the test? If so, I'd be happy to make this refactor.

Maybe in future, yep - let's begin with it as an internal utility function until we're confident that it works correctly, and also while we think (or wait for requests) about what kind of functionality it would provide to users. At the moment it'd be useful for checking the README scraper list, but I'm not sure what else yet.

The reason to be a bit cautious is that things we add to the public interface of the library can be difficult to adjust or remove - and if we don't know that it's going to be used, then let's reduce the work for ourselves.

jayaddison commented 1 month ago

@mlduff I've pushed a commit here with a reimplementation / refactor that I attempted locally. The main item I'd draw attention to is the common-longest-prefix code.

Feel free to revert the commit; it's partly a way for me to demonstrate some alternative implementation ideas I'd had. I'm not really sure it is more concise, so my commit message might have been a bit misleading.

mlduff commented 1 month ago

@jayaddison I like you changes for the whole, although a bit confused on a few aspects:

What exactly do we want the return value of get_scraper_index to be? From what I have seen, your implementation currently returns a dictionary, where the keys are the primary domain, and the values are a tuple of the scraper class itself (and all subdomains). I'm not sure this is what you were going for, as your comment says "Index the primary domain and include their secondary domains minus the shared prefix". To me, this indicates that you aren't intending to include the primary domain's suffix, so I'm not sure what your desired output is?

I'm assuming that the current output of get_scraper_index is what you want, and the comment is just slightly off or I misunderstood it. To this end, I'm currently tidying up and fixing the now broken tests.

mlduff commented 1 month ago

@jayaddison I have just pushed a commit which fixes the tests, incorporating your changes. One thing I would like to point out are the cases of (mobkitchen.co.uk, mob.co.uk) and (sharing.kptncook.com, mobile.kptncook.com). For the Mob case, I followed the same approach you did for CreativeCanning and PracticalSelfReliance as they are different domains. However, for the KptnCook case, I kept them with the same class, and added handling to check if they actually had a shared prefix - if they don't, treat them both as primary domains (this is the behavior I had before your change).

What do you think of the above? I have also brought up the date with the latest in this repository to incorporate your changes and resolve merge conflicts. Side note - not sure why the weight watchers tests are failing?

jknndy commented 1 month ago

Side note - not sure why the weight watchers tests are failing?

See my commit above 33c69ae

Haven't had time to look over all the changes yet but did take a look at this, not sure exactly whats causing the change but the newly expected output is closer to matching the actual information on the site. So a good change in the end?

One error that is still coming up after the most recent changes can be seen below, will probably look into this myself after work.

======================================================================
ERROR: test_includes (tests.library.test_readme.TestReadme)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/recipe-scrapers/recipe-scrapers/tests/library/test_readme.py", line 118, in test_includes
    scraper_index = get_scraper_index()
  File "/home/runner/work/recipe-scrapers/recipe-scrapers/tests/library/test_readme.py", line 40, in get_scraper_index
    [
  File "/home/runner/work/recipe-scrapers/recipe-scrapers/tests/library/test_readme.py", line 41, in <listcomp>
    domain.removeprefix(shared_prefix)
AttributeError: 'str' object has no attribute 'removeprefix'

======================================================================
mlduff commented 1 month ago

@jayaddison To recap, is there anything else you want me to do to get this PR ready to merge? Keen to get it in to avoid more merge conflicts.