Closed jayaddison closed 1 month ago
After considering the implications of a comment I wrote at https://github.com/hhursev/recipe-scrapers/pull/1064#issuecomment-2066792405 I thought it'd be worth checking our existing scrapers for the potential that requests could be made to unexpected domains, and MarleySpoon
appeared as a possibility, so this changeset adds a safety measure against requests to API hosts that seem unrelated.
cc @jknndy @hhursev @strangetom for review
One more note: it should be possible to generalize this check so that it could apply to other scrapers too; I've attempted to write it in a way that would allow for that.
It's only written for MarleySpoon
because that's the only place that I found where I felt that this problem could occur; in the other cases where we make multiple requests, as far as I can tell, we always use fully-qualified literal strings (with some allowance for templating) to define the request URL.
In addition, this problem can only currently affect legacy scrapers in the v14 branch of the codebase. That's not intended to be an argument to move to v15! We do lose functionality during that transition. But it helps to narrow the scrapers that should be checked for problems.
Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com
, even if the recipe URL is (for example) marleyspoon.de
?
Am I correct in thinking that for marleyspoon, the API calls are always to
api.marleyspoon.com
, even if the recipe URL is (for example)marleyspoon.de
?
@strangetom that does appear to be the case, yep. However, I'd be slightly reluctant to hard-code it, given that they've intentionally made it a variable in the page data. There are situations where doing that can allow for load-balancing / migrations / temporary maintenance by sending a portion of traffic to a different API endpoint, and it'd be nice to (safely) continue to respect that if we can.
Implementing Cross-Origin Resource Sharing adherence in the scraper could be another way to do this, in a more standards-compliant manner. I wasn't able to find any HTTP-client-side Python CORS libraries from a quick search (plenty of server-side ones), but perhaps there are some out there (or it might not be too onerous to implement basic support).
Alternatively perhaps we could check wheter the URL found in the JavaScript configuration corresponds to an entry in the SCRAPERS
map for the same scraper instance?
Explained alternatively:
html
and org_url
as usual.org_url
to a ScraperCls
from SCRAPERS
as usual.ScraperCls
wants to make an additional HTTP request:
next_url
.next_url
to a NextScraperCls
from SCRAPERS
(as in step 2).NextScraperCls
the same scraper as ScraperCls
?
I'd like to include this in the next v14
release, probably early next week, unless anyone has concerns about it. It would restrict the HTTP requests that the MarleySpoon
scrapers could make, but only by limiting those to domains we've configured MarleySpoon
scraper mappings for.
Adds a sense-checking step to ensure that the API URL returned in the MarleySpoon script elements refers to a second-level-domain containing
marleyspoon
.As far as I know, there's currently no standardized and machine-readable way to declare inter-related ownership of a group of distinct Internet domain names. I could be mistaken though; and if so, perhaps we could use that as a better alternative than checking for the brand name, as found in the scraper name.