hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.6k stars 505 forks source link

Updates to AmericasTestKitchen scraper #1116

Closed smilerz closed 2 months ago

smilerz commented 2 months ago

Some recipes are generate exceptions with default scraper.instructions() changing to retrieving data from _get_additional_details()

Strip leading \n when headnote doesn't exist.

Update instructions_group() to return a list instead of a string.

jayaddison commented 2 months ago

@smilerz while you're making adjustments here, I've got a change request that I'm thinking about, but want to ask you first.

I'd prefer if we could use a parent class for the ATK domains, and then implement each of the alternate domains by subclassing that and overriding host(). It could help to make the SCRAPERS entries a bit clearer I think, and may reduce potential conflict with #1105.

It's basically an information architecture thought - trying to keep a clearer link for various cases such as grep, Python interpreter help(...) calls, etc between the brand name, domain name, and implementation. Let me know what you think.

smilerz commented 2 months ago

So there would be a class CooksIllustrated(AmericasTestKitchen) with an override on the host() method?

Am I understanding correctly?

jayaddison commented 2 months ago

So there would be a class CooksIllustrated(AmericasTestKitchen) with an override on the host() method?

Am I understanding correctly?

Yep, exactly :+1:

smilerz commented 2 months ago

I don't have strong feelings about it, consistency is probably more important than the details. But in terms of that PR doesn't that still leave some of the sites with wildcards without a solution?

jayaddison commented 2 months ago

I don't have strong feelings about it, consistency is probably more important than the details. But in terms of that PR doesn't that still leave some of the sites with wildcards without a solution?

Possibly, although I'm not sure I completely understand the wildcard problem yet. My sense is that using per-domainname classes for ATK would allow it to appear more like any other pair of named, fixed-domain-name scrapers -- doing so would not help to resolve #1105, but it would avoid introducing an edge case for that code to handle.

Whether attempting to keep a structured domain-name-to-classname mapping is viable longer-term is a question that I expect reality to challenge us on, in any case.

If you're curious, see also #1069 where I've proposed something like cross-domain web request filtering based on scraper classnames (only relevant for v14).

smilerz commented 2 months ago

Possibly, although I'm not sure I completely understand the wildcard problem yet. My sense is that using per-domainname classes for ATK would allow it to appear more like any other pair of named, fixed-domain-name scrapers -- doing so would not help to resolve #1105, but it would avoid introducing an edge case for that code to handle.

I started to subclass those two domains and am now wondering if it is worth it. ATK is behind a paywall, so the only way to parse the recipes is to copy the page source after the site has loaded. For cooksillustrated.com and cookscountry.com the page is always ATK (at least for every page I've tried, perhaps there are outliers?).

Subclassing is trivial and would offer coverage for the odd exception, if they exist, but in any unittests I create the output of canonical_url() is going to be americastestkitchen.com/.../... which seems weird. What do you think?

jayaddison commented 2 months ago

Thanks @smilerz - yep, it may seem strange, but I think that is OK.

Things could become particularly complicated if a recipe website sometimes redirects to another domain, with a different HTML format, and sometimes does not, and we want to support both domains. I'm not aware of that happening at the moment, but will check.

smilerz commented 2 months ago

OK, made several changes:

jayaddison commented 2 months ago

Thanks @smilerz - this looks good to me, and thanks for re-using the HTML filtering code. I'm confused by the Codacy static analysis errors, because we've removed the domain= argument from the ATK base class host classmethod.

smilerz commented 2 months ago

Thanks @smilerz - this looks good to me, and thanks for re-using the HTML filtering code. I'm confused by the Codacy static analysis errors, because we've removed the domain= argument from the ATK base class host classmethod.

I looked at that too. I got it wrong initially, but went back and fixed it - not sure why it is still complaining.

jayaddison commented 2 months ago

Could you try merging the latest commits from v15 into this branch? Let's find out whether that resolves the problem.

smilerz commented 2 months ago

Could you try merging the latest commits from v15 into this branch? Let's find out whether that resolves the problem.

nope - so weird.

jayaddison commented 2 months ago

Ok - perhaps it is comparing against the base branch, and not within the current commit (it's arguably a backwards-incompatible change, although it only exists within release-candidate releases so far).

smilerz commented 2 months ago

Ok - perhaps it is comparing against the base branch, and not within the current commit (it's arguably a backwards-incompatible change, although it only exists within release-candidate releases so far).

I can change it to use 2 variables if you'd prefer.

jayaddison commented 2 months ago

Ok - perhaps it is comparing against the base branch, and not within the current commit (it's arguably a backwards-incompatible change, although it only exists within release-candidate releases so far).

I can change it to use 2 variables if you'd prefer.

Thanks for offering, but in this case it's OK - I'm confident it's not a regression/problem.