hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.61k stars 505 forks source link

MarleySpoon: add precautionary check for unexpected API URLs. #1069

Closed jayaddison closed 1 month ago

jayaddison commented 2 months ago

Adds a sense-checking step to ensure that the API URL returned in the MarleySpoon script elements refers to a second-level-domain containing marleyspoon.

As far as I know, there's currently no standardized and machine-readable way to declare inter-related ownership of a group of distinct Internet domain names. I could be mistaken though; and if so, perhaps we could use that as a better alternative than checking for the brand name, as found in the scraper name.

jayaddison commented 2 months ago

After considering the implications of a comment I wrote at https://github.com/hhursev/recipe-scrapers/pull/1064#issuecomment-2066792405 I thought it'd be worth checking our existing scrapers for the potential that requests could be made to unexpected domains, and MarleySpoon appeared as a possibility, so this changeset adds a safety measure against requests to API hosts that seem unrelated.

cc @jknndy @hhursev @strangetom for review

jayaddison commented 2 months ago

One more note: it should be possible to generalize this check so that it could apply to other scrapers too; I've attempted to write it in a way that would allow for that.

It's only written for MarleySpoon because that's the only place that I found where I felt that this problem could occur; in the other cases where we make multiple requests, as far as I can tell, we always use fully-qualified literal strings (with some allowance for templating) to define the request URL.

In addition, this problem can only currently affect legacy scrapers in the v14 branch of the codebase. That's not intended to be an argument to move to v15! We do lose functionality during that transition. But it helps to narrow the scrapers that should be checked for problems.

strangetom commented 2 months ago

Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com, even if the recipe URL is (for example) marleyspoon.de?

jayaddison commented 2 months ago

Am I correct in thinking that for marleyspoon, the API calls are always to api.marleyspoon.com, even if the recipe URL is (for example) marleyspoon.de?

@strangetom that does appear to be the case, yep. However, I'd be slightly reluctant to hard-code it, given that they've intentionally made it a variable in the page data. There are situations where doing that can allow for load-balancing / migrations / temporary maintenance by sending a portion of traffic to a different API endpoint, and it'd be nice to (safely) continue to respect that if we can.

jayaddison commented 2 months ago

Implementing Cross-Origin Resource Sharing adherence in the scraper could be another way to do this, in a more standards-compliant manner. I wasn't able to find any HTTP-client-side Python CORS libraries from a quick search (plenty of server-side ones), but perhaps there are some out there (or it might not be too onerous to implement basic support).

jayaddison commented 2 months ago

Alternatively perhaps we could check wheter the URL found in the JavaScript configuration corresponds to an entry in the SCRAPERS map for the same scraper instance?

Explained alternatively:

jayaddison commented 2 months ago

I'd like to include this in the next v14 release, probably early next week, unless anyone has concerns about it. It would restrict the HTTP requests that the MarleySpoon scrapers could make, but only by limiting those to domains we've configured MarleySpoon scraper mappings for.