Closed smilerz closed 2 months ago
@smilerz while you're making adjustments here, I've got a change request that I'm thinking about, but want to ask you first.
I'd prefer if we could use a parent class for the ATK domains, and then implement each of the alternate domains by subclassing that and overriding host()
. It could help to make the SCRAPERS
entries a bit clearer I think, and may reduce potential conflict with #1105.
It's basically an information architecture thought - trying to keep a clearer link for various cases such as grep
, Python interpreter help(...)
calls, etc between the brand name, domain name, and implementation. Let me know what you think.
So there would be a class CooksIllustrated(AmericasTestKitchen)
with an override on the host()
method?
Am I understanding correctly?
So there would be a
class CooksIllustrated(AmericasTestKitchen)
with an override on thehost()
method?Am I understanding correctly?
Yep, exactly :+1:
I don't have strong feelings about it, consistency is probably more important than the details. But in terms of that PR doesn't that still leave some of the sites with wildcards without a solution?
I don't have strong feelings about it, consistency is probably more important than the details. But in terms of that PR doesn't that still leave some of the sites with wildcards without a solution?
Possibly, although I'm not sure I completely understand the wildcard problem yet. My sense is that using per-domainname classes for ATK would allow it to appear more like any other pair of named, fixed-domain-name scrapers -- doing so would not help to resolve #1105, but it would avoid introducing an edge case for that code to handle.
Whether attempting to keep a structured domain-name-to-classname mapping is viable longer-term is a question that I expect reality to challenge us on, in any case.
If you're curious, see also #1069 where I've proposed something like cross-domain web request filtering based on scraper classnames (only relevant for v14
).
Possibly, although I'm not sure I completely understand the wildcard problem yet. My sense is that using per-domainname classes for ATK would allow it to appear more like any other pair of named, fixed-domain-name scrapers -- doing so would not help to resolve #1105, but it would avoid introducing an edge case for that code to handle.
I started to subclass those two domains and am now wondering if it is worth it. ATK is behind a paywall, so the only way to parse the recipes is to copy the page source after the site has loaded. For cooksillustrated.com and cookscountry.com the page is always ATK (at least for every page I've tried, perhaps there are outliers?).
Subclassing is trivial and would offer coverage for the odd exception, if they exist, but in any unittests I create the output of canonical_url()
is going to be americastestkitchen.com/.../...
which seems weird. What do you think?
Thanks @smilerz - yep, it may seem strange, but I think that is OK.
host()
/ test directory name is where a recipe can be retrieved from (HTTP GET ...
).hellofresh
, cookingcircle
and practicalselfreliance
).Things could become particularly complicated if a recipe website sometimes redirects to another domain, with a different HTML format, and sometimes does not, and we want to support both domains. I'm not aware of that happening at the moment, but will check.
OK, made several changes:
normalize_string()
self.ingredients()
Thanks @smilerz - this looks good to me, and thanks for re-using the HTML filtering code. I'm confused by the Codacy static analysis errors, because we've removed the domain=
argument from the ATK base class host
classmethod.
Thanks @smilerz - this looks good to me, and thanks for re-using the HTML filtering code. I'm confused by the Codacy static analysis errors, because we've removed the
domain=
argument from the ATK base classhost
classmethod.
I looked at that too. I got it wrong initially, but went back and fixed it - not sure why it is still complaining.
Could you try merging the latest commits from v15
into this branch? Let's find out whether that resolves the problem.
Could you try merging the latest commits from
v15
into this branch? Let's find out whether that resolves the problem.
nope - so weird.
Ok - perhaps it is comparing against the base branch, and not within the current commit (it's arguably a backwards-incompatible change, although it only exists within release-candidate releases so far).
Ok - perhaps it is comparing against the base branch, and not within the current commit (it's arguably a backwards-incompatible change, although it only exists within release-candidate releases so far).
I can change it to use 2 variables if you'd prefer.
Ok - perhaps it is comparing against the base branch, and not within the current commit (it's arguably a backwards-incompatible change, although it only exists within release-candidate releases so far).
I can change it to use 2 variables if you'd prefer.
Thanks for offering, but in this case it's OK - I'm confident it's not a regression/problem.
Some recipes are generate exceptions with default
scraper.instructions()
changing to retrieving data from_get_additional_details()
Strip leading
\n
when headnote doesn't exist.Update
instructions_group()
to return a list instead of a string.