hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.62k stars 508 forks source link

AlbertHeijn's scraper no longer working #990

Closed helmerzNL closed 2 months ago

helmerzNL commented 5 months ago

Pre-filing checks

The URL of the recipe(s) that are not being scraped correctly

...

The results you expect to see the recipe being scraped successfully. ...

The results (including any Python error messages) that you are seeing When trying to scrape the recipe, I got this error 'Looks Like We Couldn't Find Anything' ...

Noxeus commented 5 months ago

It seems that Albert Heijn is blocking robots from accessing their site. A simple test:

>>> request.get("https://www.ah.nl/allerhande/recept/R-R1198673/eenpanspasta-al-limone-met-hazelnootkruim")`
<Response [403]>

I tried adding a useragent header, but it was blocked all the same. Only after adding all the headers my browser would send did I get a status 200.

>>> headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8","Accept-Language":"nl,en-CA;q=0.8,en-US;q=0.5,en;q=0.3","Accept-Encoding":"gzip, deflate, br","DNT":"1","Upgrade-Insecure-Requests":"1","Sec-Fetch-Dest":"document","Sec-Fetch-Mode":"navigate","Sec-Fetch-Site":"cross-site","Sec-GPC":"1","Pragma":"no-cache","Cache-Control":"no-cache", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0"}
>>> requests.get("https://www.ah.nl/allerhande/recept/R-R1198673/eenpanspasta-al-limone-met-hazelnootkruim", headers=headers)
<Response [200]>

The returned HTML is parseable by recipe_scrapers:

>>> recipe = recipe_scrapers.scrape_html(_.text)  # _ holds the requests response from above
>>> recipe.title()
'Eenpanspasta al limone met hazelnootkruim'
helmerzNL commented 5 months ago

Is there a way to add a more 'real' header to the scraper? Or is that difficult (or not preferable) to do?

Noxeus commented 5 months ago

I did some further digging and it seems to need this extra header: {"Accept-Language":"nl"}. I'll look what I can do do edit the ah.nl scraper.

helmerzNL commented 5 months ago

@Noxeus that would be awesome!

Noxeus commented 5 months ago

@helmerzNL if you're in a hurry you can pip install git+https://github.com/Noxeus/recipe-scrapers.git@fix/issue-990/add-headers-for-request

helmerzNL commented 5 months ago

@helmerzNL if you're in a hurry you can pip install git+https://github.com/Noxeus/recipe-scrapers.git@fix/issue-990/add-headers-for-request

Does this also work if from within the Mealie docker container?

helmerzNL commented 4 months ago

Issue is still there. Any idea when this will be resolved?

helmerzNL commented 3 months ago

So, this issue cannot be resolved? ;-(

jayaddison commented 2 months ago

Thanks for your patience on this @helmerzNL - glad to hear it seems to be resolved, although as far as I know, there weren't any changes in recipe-scrapers itself that fixed it! However: it's good news, and I'll close this bugreport. Please re-open if the problem reappears.