hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.71k stars 523 forks source link

https://akispetretzikis.com stopped working #1235

Open giannis371 opened 1 month ago

giannis371 commented 1 month ago

3 months ago I added a recipe to tandoor app but now it's not working anymore. Other websites work. A recipe link https://akispetretzikis.com/en/recipe/1564/pagwto-fistiki

jayaddison commented 1 month ago

Hi @giannis371 - thanks for the bugreport! Could you confirm the version of recipe-scrapers in use in your Tandoor instance, and, if possible, any error output that occurs when attempting to read a recipe from that URL?

giannis371 commented 1 month ago

Hi @jayaddison , thanks for the fast reply. Version is recipe-scrapers==15.0.0 Error I get is Failure there was an error importing the recipe! @smilerz from tandoor found out that the page is a loading page without any recipe data.

smilerz commented 1 month ago

It appears the recipe is loaded asynchronously.

From Tandoor: scrape = scrape_html(org_url=url, html=html, supported_only=False)

  File "/opt/recipes/venv/lib/python3.12/site-packages/recipe_scrapers/__init__.py", line 844, in scrape_html

    return SCRAPERS[host_name](html=html, url=org_url)

  File "/opt/recipes/venv/lib/python3.12/site-packages/recipe_scrapers/akispetretzikis.py", line 10, in __init__

    self.soup.find("script", {"id": "__NEXT_DATA__"}).get_text()

AttributeError: 'NoneType' object has no attribute 'get_text'
jayaddison commented 1 month ago

Thanks @giannis371 @smilerz.

I was able to scrape the same recipe URL successfully using a standalone checkout of recipe-scrapers v15.0.0 (without Tandoor).

Note: recipe-scrapers doesn't contain any asynchronous request code, as far as I'm aware.

jayaddison commented 1 month ago

However.. I did not set supported_only=False during my repro attempt. I will inspect the code and/or attempt again related to that soon.

smilerz commented 1 month ago

However.. I did not set supported_only=False during my repro attempt. I will inspect the code and/or attempt again related to that soon.

weird - I get the same error running both
scrape = scrape_html(org_url=url, html=html, supported_only=False) and scrape = scrape_html(org_url=url, html=html)

What are you using to collect the html? Tandoor is using:

html = requests.get(
                                url,
                                headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
                            ).content
jayaddison commented 1 month ago

What are you using to collect the html? Tandoor is using: [ ... snip ... ]

I was using the built-in HTTP headers from recipe-scrapers v15.0.0 (online=True) -- and so I think this is probably a case where we have a site that produces different results based on the HTTP client / browser in use.

Note / reminder that more-recent versions of recipe-scrapers now declare a custom user-agent header (ref #1219) -- I have not tested that one yet, but will do also to compare.

smilerz commented 1 month ago

Now I'm really confused because scrape_html(org_url=url, html=None, online=True) generates the exact same error.

image

jayaddison commented 1 month ago

I'm not sure what to suggest here; I'm able to access the recipe using both 15.0.0 and 15.1.0 -- so to me it would seem that there is at least one factor outside of this codebase that determines whether the recipe HTML is provided at request-time.

giannis371 commented 1 month ago

Indeed that's very confusing.

smilerz commented 1 month ago

I'm not sure what to suggest here; I'm able to access the recipe using both 15.0.0 and 15.1.0 -- so to me it would seem that there is at least one factor outside of this codebase that determines whether the recipe HTML is provided at request-time.

Agreed - I don't think this is software related.

giannis371 commented 1 month ago

So it means we just have to close the issue and remove the website from the scrappers-list or is there any chance to find a solution?

smilerz commented 1 month ago

FWIW, once loaded the recipe metadata is still in Greek. so even if it worked, not sure you would get what you were looking for.

jayaddison commented 1 month ago

So it means we just have to close the issue and remove the website from the scrappers-list or is there any chance to find a solution?

We don't have to close the issue, but at the moment it seems unclear whether there is a technical solution. What benefits would removing the recipe scraper provide?

giannis371 commented 1 month ago

FWIW, once loaded the recipe metadata is still in Greek. so even if it worked, not sure you would get what you were looking for.

No you can just change the language in English and the metadata is then also in English. Like I said 5 months ago I added a recipe from this website and it's still in my tandoor working correctly. akispetretzikis.com/en

giannis371 commented 1 month ago

So it means we just have to close the issue and remove the website from the scrappers-list or is there any chance to find a solution?

We don't have to close the issue, but at the moment it seems unclear whether there is a technical solution. What benefits would removing the recipe scraper provide?

Yes you are right. You said before it's working fine with your system so the problem is not not the scrapper. That means other people can still use this website.

smilerz commented 1 month ago

No you can just change the language. Like I said 5 months ago I added a recipe from this website and it's still in my tandoor working correctly.

They've obviously made a change how the website works. I loaded the recipe using the bookmarklet and it's in Greek.