Open JosephAlanLane opened 3 months ago
Thank you @JosephAlanLane for the report. This seems to be due to the HTTP request headers used in the tutorials for version 15.0.0 of the library - I can replicate the same errors, and I find that using scrape_me
from version 14.58.2 of the library works fine.
Please could you let me know whether the following works for you; if so, I'll update the documentation accordingly:
>>> from recipe_scrapers import HEADERS, scrape_html
>>> import requests
>>> url = 'https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/'
>>> html = requests.get(url, headers=HEADERS).content
>>> scraper = scrape_html(html=html, org_url=url)
>>> scraper.ingredients()
Thank you for the quick follow up @jayaddison, happy to see this project is still going!
As for the code that you provided, I received this error: ImportError: cannot import name 'HEADERS' from 'recipe_scrapers' (/opt/anaconda3/lib/python3.12/site-packages/recipe_scrapers/init.py)
I also tried version 14.58.2, reloaded my jupyter kernel and it works! Thank you for the help
Tested locally on a fresh install and couldn't replicate the issue regarding the HEADERS
import
Although it may work, after further thought I have to admit to feeling slightly uneasy about placing an example that uses the somewhat-internal (although available) HEADERS
into the README
file examples. I think that's because I feel that under a strict reading of anti-circumvention techniques, it could seem like we'd be encouraging a way to retrieve recipes that bypasses a spam-prevention measure.
The alternatives I've considered so far are:
online=True
flag in the README.rst
examples - people would implicitly use the HEADERS
, but we wouldn't be encouraging their adoption in external code/practices.HEADERS
in the example, but add a (recipe-scrapers)
substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.The second and third ideas aren't necessarily mutually-exclusive.
The alternatives I've considered so far are:
Do nothing - people would have to learn the reasons why these HTTP requests are denied on their own.
Use the
online=True
flag in theREADME.rst
examples - people would implicitly use theHEADERS
, but we wouldn't be encouraging their adoption in external code/practices.Continue to use the
HEADERS
in the example, but add a(recipe-scrapers)
substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.
One more idea: retain the use of an example/test user-agent
header in the sample code in the README
, but use a more standards-compliant format.
Perhaps there is already a well-known test/example user-agent
header in a technical specification somewhere.
It wouldn't be a guarantee that sites would accept requests using that format -- but a non-standard format seems like it might be rejected due to parsing/validation failures. I don't know if that's what is happening in the cases reported here, but we could check.
Perhaps there is already a well-known test/example
user-agent
header in a technical specification somewhere.
I haven't been able to locate a standard, generic user-agent
of this kind. I am aware that a few years ago, some browsers suggested dropping the user-agent
HTTP header entirely -- but to the best of my knowledge, that idea seems not to have gained traction.
One more idea: retain the use of an example/test
user-agent
header in the sample code in theREADME
, but use a more standards-compliant format.
This wouldn't really solve my original concern about the provision of a known-widely-accepted user-agent in example code potentially being a circumvention measure.
It would also increase the length of the example code quite a lot, and reduce the comprehensibility of it. When learning code, I think it's very useful to be introduced to a small number of ideas and code tokens, and to be able to ask about and get useful explanations for all of them. The user-agent header, in my opinion, opens a raft of complicated and irrelevant learning conversations. Admittedly some of them are interesting in terms of web history, protocol/client negotiation, standards-setting and so on -- but if the teaching goal is to retrieve a recipe and parse it using a couple of lines of code, then the balance moves towards omitting the header from the tutorials.
My preferred options currently are to accept options two and three:
- Use the
online=True
flag in theREADME.rst
examples - people would implicitly use theHEADERS
, but we wouldn't be encouraging their adoption in external code/practices.
This should provide a good initial developer and teaching experience with the library.
- Continue to use the
HEADERS
in the example, but add a(recipe-scrapers)
substring into the user-agent header. Maybe many sites would still accept that, but it would allow sites to block this library more selectively if they really want to.
Done carefully -- by evaluating an acceptable level of uniqueness and compatibility for the user-agent -- I think this should allow retaining good initial user experience while also allowing recipe sites to identify traffic from this library if they want to.
Currently this is prototyped in #1221 (evaluation of the updated user-agent string is pending).
I tried the tutorial in the readme, and then went through some sites on the list. Only one worked: https://thinlicious.com/low-carb-pumpkin-cream-cheese-swirl-muffins/
These did not: Error scraping recipe from https://www.budgetbytes.com/honey-garlic-chicken/: This should be implemented. Error scraping recipe from https://allthehealthythings.com/healthy-slow-cooker-chili/: This should be implemented. Error scraping recipe from https://altonbrown.com/recipes/dry-aged-chimney-porterhouse/: This should be implemented. Error scraping recipe from https://www.ambitiouskitchen.com/gochujang-chicken-sandwiches/: This should be implemented. Error scraping recipe from https://www.yummly.com/recipe/Sinigang-na-Bangus-sa-Bayabas-9345315: This should be implemented.