hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.75k stars 535 forks source link

Error scraping recipe from... #1214

Open JosephAlanLane opened 3 months ago

JosephAlanLane commented 3 months ago

I tried the tutorial in the readme, and then went through some sites on the list. Only one worked: https://thinlicious.com/low-carb-pumpkin-cream-cheese-swirl-muffins/

These did not: Error scraping recipe from https://www.budgetbytes.com/honey-garlic-chicken/: This should be implemented. Error scraping recipe from https://allthehealthythings.com/healthy-slow-cooker-chili/: This should be implemented. Error scraping recipe from https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/: This should be implemented. Error scraping recipe from https://www.ambitiouskitchen.com/gochujang-chicken-sandwiches/: This should be implemented. Error scraping recipe from https://www.yummly.com/recipe/Sinigang-na-Bangus-sa-Bayabas-9345315: This should be implemented.

jayaddison commented 3 months ago

Thank you @JosephAlanLane for the report. This seems to be due to the HTTP request headers used in the tutorials for version 15.0.0 of the library - I can replicate the same errors, and I find that using scrape_me from version 14.58.2 of the library works fine.

Please could you let me know whether the following works for you; if so, I'll update the documentation accordingly:

>>> from recipe_scrapers import HEADERS, scrape_html
>>> import requests
>>> url = 'https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/'
>>> html = requests.get(url, headers=HEADERS).content
>>> scraper = scrape_html(html=html, org_url=url)
>>> scraper.ingredients()
JosephAlanLane commented 3 months ago

Thank you for the quick follow up @jayaddison, happy to see this project is still going!

As for the code that you provided, I received this error: ImportError: cannot import name 'HEADERS' from 'recipe_scrapers' (/opt/anaconda3/lib/python3.12/site-packages/recipe_scrapers/init.py)

I also tried version 14.58.2, reloaded my jupyter kernel and it works! Thank you for the help

jknndy commented 3 months ago

Tested locally on a fresh install and couldn't replicate the issue regarding the HEADERS import

full run ```python Python 3.12.4 Type "help", "copyright", "credits" or "license" for more information. >>> from recipe_scrapers import HEADERS, scrape_html >>> import requests >>> url = 'https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/' >>> html = requests.get(url, headers=HEADERS).content >>> scraper = scrape_html(html=html, org_url=url) >>> scraper.ingredients() ['1 (1 1/4-inch thick) porterhouse steak, preferably grass-fed', '1 teaspoon kosher salt', 'vegetable oil (for spritzing the newspaper)'] >>> scraper.title() 'Dry-Aged Chimney Porterhouse' ```
jayaddison commented 3 months ago

Although it may work, after further thought I have to admit to feeling slightly uneasy about placing an example that uses the somewhat-internal (although available) HEADERS into the README file examples. I think that's because I feel that under a strict reading of anti-circumvention techniques, it could seem like we'd be encouraging a way to retrieve recipes that bypasses a spam-prevention measure.

The alternatives I've considered so far are:

The second and third ideas aren't necessarily mutually-exclusive.

jayaddison commented 3 months ago

The alternatives I've considered so far are:

  • Do nothing - people would have to learn the reasons why these HTTP requests are denied on their own.

  • Use the online=True flag in the README.rst examples - people would implicitly use the HEADERS, but we wouldn't be encouraging their adoption in external code/practices.

  • Continue to use the HEADERS in the example, but add a (recipe-scrapers) substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.

One more idea: retain the use of an example/test user-agent header in the sample code in the README, but use a more standards-compliant format.

Perhaps there is already a well-known test/example user-agent header in a technical specification somewhere.

It wouldn't be a guarantee that sites would accept requests using that format -- but a non-standard format seems like it might be rejected due to parsing/validation failures. I don't know if that's what is happening in the cases reported here, but we could check.

jayaddison commented 3 months ago

Perhaps there is already a well-known test/example user-agent header in a technical specification somewhere.

I haven't been able to locate a standard, generic user-agent of this kind. I am aware that a few years ago, some browsers suggested dropping the user-agent HTTP header entirely -- but to the best of my knowledge, that idea seems not to have gained traction.

One more idea: retain the use of an example/test user-agent header in the sample code in the README, but use a more standards-compliant format.

This wouldn't really solve my original concern about the provision of a known-widely-accepted user-agent in example code potentially being a circumvention measure.

It would also increase the length of the example code quite a lot, and reduce the comprehensibility of it. When learning code, I think it's very useful to be introduced to a small number of ideas and code tokens, and to be able to ask about and get useful explanations for all of them. The user-agent header, in my opinion, opens a raft of complicated and irrelevant learning conversations. Admittedly some of them are interesting in terms of web history, protocol/client negotiation, standards-setting and so on -- but if the teaching goal is to retrieve a recipe and parse it using a couple of lines of code, then the balance moves towards omitting the header from the tutorials.

My preferred options currently are to accept options two and three:

  • Use the online=True flag in the README.rst examples - people would implicitly use the HEADERS, but we wouldn't be encouraging their adoption in external code/practices.

This should provide a good initial developer and teaching experience with the library.

  • Continue to use the HEADERS in the example, but add a (recipe-scrapers) substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.

Done carefully -- by evaluating an acceptable level of uniqueness and compatibility for the user-agent -- I think this should allow retaining good initial user experience while also allowing recipe sites to identify traffic from this library if they want to.

Currently this is prototyped in #1221 (evaluation of the updated user-agent string is pending).