Open giannis371 opened 1 month ago
Hi @giannis371 - thanks for the bugreport! Could you confirm the version of recipe-scrapers
in use in your Tandoor instance, and, if possible, any error output that occurs when attempting to read a recipe from that URL?
Hi @jayaddison , thanks for the fast reply. Version is recipe-scrapers==15.0.0 Error I get is Failure there was an error importing the recipe! @smilerz from tandoor found out that the page is a loading page without any recipe data.
It appears the recipe is loaded asynchronously.
From Tandoor:
scrape = scrape_html(org_url=url, html=html, supported_only=False)
File "/opt/recipes/venv/lib/python3.12/site-packages/recipe_scrapers/__init__.py", line 844, in scrape_html
return SCRAPERS[host_name](html=html, url=org_url)
File "/opt/recipes/venv/lib/python3.12/site-packages/recipe_scrapers/akispetretzikis.py", line 10, in __init__
self.soup.find("script", {"id": "__NEXT_DATA__"}).get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
Thanks @giannis371 @smilerz.
I was able to scrape the same recipe URL successfully using a standalone checkout of recipe-scrapers
v15.0.0 (without Tandoor).
Note: recipe-scrapers
doesn't contain any asynchronous request code, as far as I'm aware.
However.. I did not set supported_only=False
during my repro attempt. I will inspect the code and/or attempt again related to that soon.
However.. I did not set
supported_only=False
during my repro attempt. I will inspect the code and/or attempt again related to that soon.
weird - I get the same error running both
scrape = scrape_html(org_url=url, html=html, supported_only=False)
and
scrape = scrape_html(org_url=url, html=html)
What are you using to collect the html? Tandoor is using:
html = requests.get(
url,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
).content
What are you using to collect the html? Tandoor is using: [ ... snip ... ]
I was using the built-in HTTP headers from recipe-scrapers
v15.0.0 (online=True
) -- and so I think this is probably a case where we have a site that produces different results based on the HTTP client / browser in use.
Note / reminder that more-recent versions of recipe-scrapers
now declare a custom user-agent header (ref #1219) -- I have not tested that one yet, but will do also to compare.
Now I'm really confused because scrape_html(org_url=url, html=None, online=True)
generates the exact same error.
I'm not sure what to suggest here; I'm able to access the recipe using both 15.0.0 and 15.1.0 -- so to me it would seem that there is at least one factor outside of this codebase that determines whether the recipe HTML is provided at request-time.
Indeed that's very confusing.
I'm not sure what to suggest here; I'm able to access the recipe using both 15.0.0 and 15.1.0 -- so to me it would seem that there is at least one factor outside of this codebase that determines whether the recipe HTML is provided at request-time.
Agreed - I don't think this is software related.
So it means we just have to close the issue and remove the website from the scrappers-list or is there any chance to find a solution?
FWIW, once loaded the recipe metadata is still in Greek. so even if it worked, not sure you would get what you were looking for.
So it means we just have to close the issue and remove the website from the scrappers-list or is there any chance to find a solution?
We don't have to close the issue, but at the moment it seems unclear whether there is a technical solution. What benefits would removing the recipe scraper provide?
FWIW, once loaded the recipe metadata is still in Greek. so even if it worked, not sure you would get what you were looking for.
No you can just change the language in English and the metadata is then also in English. Like I said 5 months ago I added a recipe from this website and it's still in my tandoor working correctly. akispetretzikis.com/en
So it means we just have to close the issue and remove the website from the scrappers-list or is there any chance to find a solution?
We don't have to close the issue, but at the moment it seems unclear whether there is a technical solution. What benefits would removing the recipe scraper provide?
Yes you are right. You said before it's working fine with your system so the problem is not not the scrapper. That means other people can still use this website.
No you can just change the language. Like I said 5 months ago I added a recipe from this website and it's still in my tandoor working correctly.
They've obviously made a change how the website works. I loaded the recipe using the bookmarklet and it's in Greek.
3 months ago I added a recipe to tandoor app but now it's not working anymore. Other websites work. A recipe link https://akispetretzikis.com/en/recipe/1564/pagwto-fistiki