Error scraping recipe from...

JosephAlanLane commented 3 months ago

I tried the tutorial in the readme, and then went through some sites on the list. Only one worked: https://thinlicious.com/low-carb-pumpkin-cream-cheese-swirl-muffins/

These did not: Error scraping recipe from https://www.budgetbytes.com/honey-garlic-chicken/: This should be implemented. Error scraping recipe from https://allthehealthythings.com/healthy-slow-cooker-chili/: This should be implemented. Error scraping recipe from https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/: This should be implemented. Error scraping recipe from https://www.ambitiouskitchen.com/gochujang-chicken-sandwiches/: This should be implemented. Error scraping recipe from https://www.yummly.com/recipe/Sinigang-na-Bangus-sa-Bayabas-9345315: This should be implemented.

jayaddison commented 3 months ago

Thank you @JosephAlanLane for the report. This seems to be due to the HTTP request headers used in the tutorials for version 15.0.0 of the library - I can replicate the same errors, and I find that using scrape_me from version 14.58.2 of the library works fine.

Please could you let me know whether the following works for you; if so, I'll update the documentation accordingly:

>>> from recipe_scrapers import HEADERS, scrape_html
>>> import requests
>>> url = 'https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/'
>>> html = requests.get(url, headers=HEADERS).content
>>> scraper = scrape_html(html=html, org_url=url)
>>> scraper.ingredients()

JosephAlanLane commented 3 months ago

Thank you for the quick follow up @jayaddison, happy to see this project is still going!

As for the code that you provided, I received this error: ImportError: cannot import name 'HEADERS' from 'recipe_scrapers' (/opt/anaconda3/lib/python3.12/site-packages/recipe_scrapers/init.py)

I also tried version 14.58.2, reloaded my jupyter kernel and it works! Thank you for the help

jknndy commented 3 months ago

Tested locally on a fresh install and couldn't replicate the issue regarding the HEADERS import

full run

```python Python 3.12.4 Type "help", "copyright", "credits" or "license" for more information. >>> from recipe_scrapers import HEADERS, scrape_html >>> import requests >>> url = 'https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/' >>> html = requests.get(url, headers=HEADERS).content >>> scraper = scrape_html(html=html, org_url=url) >>> scraper.ingredients() ['1 (1 1/4-inch thick) porterhouse steak, preferably grass-fed', '1 teaspoon kosher salt', 'vegetable oil (for spritzing the newspaper)'] >>> scraper.title() 'Dry-Aged Chimney Porterhouse' ```

jayaddison commented 3 months ago

Although it may work, after further thought I have to admit to feeling slightly uneasy about placing an example that uses the somewhat-internal (although available) HEADERS into the README file examples. I think that's because I feel that under a strict reading of anti-circumvention techniques, it could seem like we'd be encouraging a way to retrieve recipes that bypasses a spam-prevention measure.

The alternatives I've considered so far are:

Do nothing - people would have to learn the reasons why these HTTP requests are denied on their own.
Use the online=True flag in the README.rst examples - people would implicitly use the HEADERS, but we wouldn't be encouraging their adoption in external code/practices.
Continue to use the HEADERS in the example, but add a (recipe-scrapers) substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.

The second and third ideas aren't necessarily mutually-exclusive.

jayaddison commented 3 months ago

The alternatives I've considered so far are:

Do nothing - people would have to learn the reasons why these HTTP requests are denied on their own.

Use the online=True flag in the README.rst examples - people would implicitly use the HEADERS, but we wouldn't be encouraging their adoption in external code/practices.

Continue to use the HEADERS in the example, but add a (recipe-scrapers) substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.

One more idea: retain the use of an example/test user-agent header in the sample code in the README, but use a more standards-compliant format.

Perhaps there is already a well-known test/example user-agent header in a technical specification somewhere.

It wouldn't be a guarantee that sites would accept requests using that format -- but a non-standard format seems like it might be rejected due to parsing/validation failures. I don't know if that's what is happening in the cases reported here, but we could check.

jayaddison commented 3 months ago

Perhaps there is already a well-known test/example user-agent header in a technical specification somewhere.

I haven't been able to locate a standard, generic user-agent of this kind. I am aware that a few years ago, some browsers suggested dropping the user-agent HTTP header entirely -- but to the best of my knowledge, that idea seems not to have gained traction.

One more idea: retain the use of an example/test user-agent header in the sample code in the README, but use a more standards-compliant format.

This wouldn't really solve my original concern about the provision of a known-widely-accepted user-agent in example code potentially being a circumvention measure.

It would also increase the length of the example code quite a lot, and reduce the comprehensibility of it. When learning code, I think it's very useful to be introduced to a small number of ideas and code tokens, and to be able to ask about and get useful explanations for all of them. The user-agent header, in my opinion, opens a raft of complicated and irrelevant learning conversations. Admittedly some of them are interesting in terms of web history, protocol/client negotiation, standards-setting and so on -- but if the teaching goal is to retrieve a recipe and parse it using a couple of lines of code, then the balance moves towards omitting the header from the tutorials.

My preferred options currently are to accept options two and three:

Use the online=True flag in the README.rst examples - people would implicitly use the HEADERS, but we wouldn't be encouraging their adoption in external code/practices.

This should provide a good initial developer and teaching experience with the library.

Continue to use the HEADERS in the example, but add a (recipe-scrapers) substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.

Done carefully -- by evaluating an acceptable level of uniqueness and compatibility for the user-agent -- I think this should allow retaining good initial user experience while also allowing recipe sites to identify traffic from this library if they want to.

Currently this is prototyped in #1221 (evaluation of the updated user-agent string is pending).

hhursev / recipe-scrapers

Error scraping recipe from... #1214