mealie-recipes / mealie

Mealie is a self hosted recipe manager and meal planner with a RestAPI backend and a reactive frontend application built in Vue for a pleasant user experience for the whole family. Easily add recipes into your database by providing the url and mealie will automatically import the relevant data or add a family recipe with the UI editor
https://docs.mealie.io
GNU Affero General Public License v3.0
7.22k stars 719 forks source link

[SCRAPER] - No way to scrape sites with required login (such as Blue Apron) #4027

Closed zeekaran closed 1 month ago

zeekaran commented 2 months ago

First Check

Please provide 1-5 example URLs that are having errors

https://www.blueapron.com/recipes/romesco-lasagna-with-mozzarella-spinach-roasted-red-peppers

Please provide your logs for the Mealie container docker logs <container-id> > mealie.logs

INFO 2024-08-12T18:20:14 - HTTP Request: GET https://www.blueapron.com/recipes/romesco-lasagna-with-mozzarella-spinach-roasted-red-peppers "HTTP/1.1 403 Forbidden" INFO 2024-08-12T18:20:14 - [192.168.1.1:0] 400 Bad Request "POST /api/recipes/create-url HTTP/1.1"

Deployment

Docker (Linux)

broyuken commented 2 months ago

The odd thing is these recipe's don't need you to be logged in to scrape. It just can't grab them. I was trying to import this one as well but it can't.

https://www.blueapron.com/recipes/roasted-sweet-potato-caramelized-onion-pizza-with-creamy-bechamel-fontina-cheese-arugula-salad

The debugger just says "recipe_scrapers was unable to scrape this URL"

Edit: Adding my logs

[DEBUG|locale|L140] 2024-08-15T15:38:47: Language set to en                                                                                                                                                      [DEBUG|_config|L80] 2024-08-15T15:38:47: load_ssl_context verify=True cert=None trust_env=True http2=False                                                                                                       [DEBUG|_config|L146] 2024-08-15T15:38:47: load_verify_locations cafile='/opt/pysetup/.venv/lib/python3.10/site-packages/certifi/cacert.pem'                                                                      [DEBUG|_trace|L85] 2024-08-15T15:38:47: connect_tcp.started host='www.blueapron.com' port=443 local_address=None timeout=15 socket_options=None                                                                  [DEBUG|_trace|L85] 2024-08-15T15:38:47: connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f137b364b80>                                                                        [DEBUG|_trace|L85] 2024-08-15T15:38:47: start_tls.started ssl_context=<ssl.SSLContext object at 0x7f137937e440> server_hostname='www.blueapron.com' timeout=15                                                   [DEBUG|_trace|L85] 2024-08-15T15:38:47: start_tls.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x7f137b367bb0>                                                                          [DEBUG|_trace|L85] 2024-08-15T15:38:47: send_request_headers.started request=<Request [b'GET']>                                                                                                                  [DEBUG|_trace|L85] 2024-08-15T15:38:47: send_request_headers.complete                                                                                                                                            [DEBUG|_trace|L85] 2024-08-15T15:38:47: send_request_body.started request=<Request [b'GET']>                                                                                                                     [DEBUG|_trace|L85] 2024-08-15T15:38:47: send_request_body.complete                                                                                                                                               [DEBUG|_trace|L85] 2024-08-15T15:38:47: receive_response_headers.started request=<Request [b'GET']>                                                                                                              [DEBUG|_trace|L85] 2024-08-15T15:38:47: receive_response_headers.complete return_value=(b'HTTP/1.1', 403, b'Forbidden', [(b'Connection', b'keep-alive'), (b'Content-Length', b'583'), (b'Content-Type', b'text/ht[INFO|_client|L1773] 2024-08-15T15:38:47: HTTP Request: GET https://www.blueapron.com/recipes/romesco-lasagna-with-mozzarella-spinach-roasted-red-peppers "HTTP/1.1 403 Forbidden"
[DEBUG|_trace|L85] 2024-08-15T15:38:47: receive_response_body.started request=<Request [b'GET']>
[DEBUG|_trace|L85] 2024-08-15T15:38:47: receive_response_body.complete
[DEBUG|_trace|L85] 2024-08-15T15:38:47: response_closed.started
[DEBUG|_trace|L85] 2024-08-15T15:38:47: response_closed.complete
[DEBUG|_trace|L85] 2024-08-15T15:38:47: close.started
[DEBUG|_trace|L85] 2024-08-15T15:38:47: close.complete
[DEBUG|scraper_strategies|L226] 2024-08-15T15:38:47: Recipe Scraper [Package] was unable to extract a recipe from https://www.blueapron.com/recipes/romesco-lasagna-with-mozzarella-spinach-roasted-red-peppers
[DEBUG|scraper_strategies|L226] 2024-08-15T15:38:47: Recipe Scraper [Package] was unable to extract a recipe from https://www.blueapron.com/recipes/romesco-lasagna-with-mozzarella-spinach-roasted-red-peppers
[INFO|httptools_impl|L466] 2024-08-15T15:38:47: 10.0.86.92:0 - "POST /api/recipes/create-url HTTP/1.1" 400
[INFO|httptools_impl|L466] 2024-08-15T15:39:05: 127.0.0.1:40346 - "GET /api/app/about HTTP/1.1" 200
github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.