hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.74k stars 531 forks source link

foodnetwork.com not scraping #1206

Open disconn3ct opened 3 months ago

disconn3ct commented 3 months ago

Pre-filing checks

v15.0.0

The URL of the recipe(s) that are not being scraped correctly

The results you expect to see Asparagus fries recipe import (Mealie) ...

The results (including any Python error messages) that you are seeing

Error: Request failed with status code 400
    NuxtJS 54
...
        _wrapper
[ad26c94.js:1:294004](https://recipes.foobar/_nuxt/ad26c94.js)
mealie INFO     2024-08-05T18:16:39 - HTTP Request: GET https://www.foodnetwork.com/recipes/food-network-kitchen/asparagus-fries-3908446 "HTTP/1.1 403 Forbidden"

...

thewolfman56 commented 3 months ago

I mentioned the same issue in #1119 . I notice the scraping issues in Mealie, and then I test the scraper itself using powershell.

jayaddison commented 3 months ago

Hi @disconn3ct - thanks for the bugreport. I think the HTTP 403 response code is a clue here - for some reason the site denied the access to the recipe.

I'm not sure what to suggest about this; are you able to open their homepage in a web browser and to open recipes from there?

williamkray commented 3 months ago

i also am here due to this issue, from the mealie project. related issue for it in their fallback behavior is here, further notes that i've added during my debugging. it's entirely user-agent related, updating the user-agent string should fix it.

williamkray commented 3 months ago

i thought it was just mealie's fallback user-agent string, but they set that to match the header set in recipe_scrapers, which is responsible for throwing a 403 from foodnetwork.com.

root@0f0411e5a883:/# python
Python 3.12.5 (main, Aug  7 2024, 19:13:43) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get("https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067", headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0"})
<Response [403]>
>>> requests.get("https://www.foodnetwork.com/recipes/ina-garten/garlic-roasted-potatoes-recipe-1913067", headers={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0"})
<Response [200]>
>>> 
jayaddison commented 3 months ago

I'll allow this issue to stay open for a little while in case anyone has suggestions, but I have to admit that, based on my current understanding of the world wide web and HTTP, any solution we have here would likely be in response to specific hidden workings of foodnetwork.com -- and that's not the kind of protocol game that I think recipe-scrapers should get into.

jayaddison commented 3 months ago

Perhaps we could detect blockage and redirect to archived pages on archive.org or similar when available? But that would, essentially, be silently redirecting our users into the past. I don't think that's a progressive approach (archived HTML could be useful to check backwards compatibility, or for other scraper healthcheck processes - but it's less useful for everyday usage, in my opinion).

williamkray commented 3 months ago

would it make sense to have a small array of user-agent strings to randomly pick from, and include some light retry logic? without telemetry it would be trickier to find the ones that aren't working (unless you implemented some regular testing of some kind) but it would help alleviate the headache some...

jayaddison commented 3 months ago

Personally, I think that that path could take us into a series of likely-to-be-unsolvable technical and privacy challenges (how to confirm consent? how to confirm authenticity of telemetry? where to store it?) that, in the worst case, could put the library at some risk, and therefore may not be the most respectful use of developers and users time.

I wouldn't necessarily block pull requests to move in that direction, but I'd much prefer for the library to attempt to be fair and consistent for all users, while avoiding techniques that could be seen as evasive by recipe websites (randomization of user-agent I think would fall into that category).

williamkray commented 3 months ago

oh yeah i'm not suggesting telemetry be added.

in a perfect world the library should be fair and consistent, but the fact of the matter is its intention is to bridge a gap between humans and websites, and those websites are funky and have weird rules that may be arbitrary. there will always be some level of stop-gap solution in place in order to make this middleware work. however, i recognize that including it as a problem to be solved in this library is a sledge hammer for a very fine nail.

perhaps it is up to the mealie project (and others that use this library) to decide whether to leverage the headers provided by this library, or to instead introduce some of those workarounds for arbitrary rules set by various sites to provide a better user experience.

jayaddison commented 3 months ago

Thanks @williamkray. I'd misunderstood your previous message as suggesting that telemetry was necessary - after re-reading, I get that you were only highlighting the challenge of proceeding without it (I agree it's a bit of a challenge, but I also think it's preferable in some ways to rely on direct user-reported feedback; so let's continue as-is).

To recap and refresh the context a bit: recent user feedback seems to indicate that the headers from the library do help to retrieve content for a number of recipe sites; however if I understand the behaviour in this thread correctly, it's a recipe site where the built-in headers are explicitly blocked (the reverse of what we find elsewhere).

Although this was reported for version 15.0.0, the same problem seems to occur in previous versions (for example, 14.58.2 when using scrape_me, a deprecated API that gathers HTML itself and uses the built-in headers).

I'll spend some time to consider what options we have here; I'm beginning to wonder whether we might want to somehow indicate that foodnetwork.com is less-supported -- as in: we theoretically have support, but online retrieval using the built-in headers won't work.

That wouldn't be ideal for us (reducing supported scrapers), but if the outcome is to reduce (denied) requests to their site, then I think that would be an improvement, and it would highlight to our users that help would be required to restore complete support.

Again: I'm still thinking through the options here, those are just some ideas so far.

williamkray commented 3 months ago

i appreciate the transparency into your thought processes.

in my opinion, as someone coming from the outside and seeing these projects for the first time, i'm tempted to suggest that more separation be done in the core functionality of the services. i would suggest that this library focus on being a parsing library and avoid the "online" topic altogether, leaving that as an implementation detail for any project that wants to fetch recipe site content to be parsed.

based on the example test code in the readme, this feels like what it was originally intended for (it's up to the user to fetch the site data with the requests library in this example) but perhaps somewhere along the line this project pulled in additional functionality for standalone use... i'm not clear on the history but i'm definitely familiar with "scope creep" :sweat_smile:

this would shift the concept of site-support toward a straightforward concept: either this library can read and parse the site content and turn it into a recipe, or it cannot (yet). how the application using the library gets that site data is up to the application developer, whether it's by forging UA headers in a get request or having users right-click and "save website as html" into a folder, etc.

this change would be somewhat draconian, and would require announcements of deprecation and coordination among projects using the library (easily handled with strict versioning of course), but would potentially help eliminate confusion around duplicated functionality like this... this is currently both a mealie problem, and a recipe_scrapers problem because both potentially attempt to handle the online http request (i think), so there are duplicate bug reports.

jayaddison commented 3 months ago

Yep, that's mostly accurate @williamkray, and your perspective is fairly similar to the way I've thought about the library. In particular, the version 15 branch / release line moves gradually in the direction of removing online retrieval.

Regarding the history of the library though, I think simplicitly has traditionally been the goal - allowing people to access recipe information with the minimum of fuss, and providing quite an impressive demonstration of what a small software library can do in the process (load a dependency, write a line of code, input a URL -- and you can explore a recipe). I think that's particularly beneficial for newcomers and people using this in small projects.

There's some tension/conflict in moving to pure-HTML based scraping because it naturally adds a bit of complexity to the tutorial/example code -- in fact, quite a lot of complexity, in my opinion, compared relatively to previous versions. Sites are also more likely to block HTTP clients configured with default/vanilla settings (because presumably traffic from those tends to be unusual, spammy and automated -- even though not all usage of them will be malign). So without some care, the learning/experimentation phase could also become unnecessarily frustrating.

I'd like to find a way to navigate and balance all of the above concerns.