hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.62k stars 508 forks source link

Fixes #990 ah.nl no longer working #992

Closed Noxeus closed 2 months ago

Noxeus commented 5 months ago

Guess I'll just copy the whole __init__ from super then :) Also, for feature readers: it seems ah.nl actively blocks some servers. My mealie runs on oracle and won't accept my GET, whatever headers I throw at it. From home/Windows it works fine. But then from wsl it doesn't work. What more can we do to mask our approach?

helmerzNL commented 5 months ago

Just a question (as a noob), any idea when this PR will be eh...part of the release?

jayaddison commented 5 months ago

@Noxeus can you confirm whether this approach still works for you? I received an 'access denied' result when attempting to retrieve a recipe with this code and HTTP headers.

Noxeus commented 5 months ago
r = scrape_me("https://www.ah.nl/allerhande/recept/R-R1198673/eenpanspasta-al-limone-met-hazelnootkruim")
r.title()
'Eenpanspasta al limone met hazelnootkruim'

But like I said here . On wsl it doesn't (code is the same). Very strange.

helmerzNL commented 5 months ago

Any progress? :-)

Noxeus commented 5 months ago

Any progress? :-)

For me, it still works... Like I said, I can't figure out the differences between my Windows Python (works) and WSL2 Python (doesn't)

helmerzNL commented 5 months ago

Any progress? :-)

For me, it still works... Like I said, I can't figure out the differences between my Windows Python (works) and WSL2 Python (doesn't)

But the check failed, so it will not be committed to the branch?

helmerzNL commented 4 months ago

any updates on when this will be commited? :-) @jayaddison ?

helmerzNL commented 4 months ago

Any progress?

Noxeus commented 4 months ago

I'm going to be totally honest: I don't think I can fix this. I tried every angle that I know, but the server just blocks my requests on some platforms and on some it doesn't, even though I use the same IP and headers. So for example, I can use

curl 'https://www.ah.nl/allerhande/recept/R-R1197770/koffie-panna-cotta-met-koffiegelei-en-chocola' --compressed -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:123.0) Gecko/20100101 Firefox/123.0' -H 'Accept-Language: nl' 

on Mac but in my mealie docker bash it fails with 403 status (again, uses same IP).

I can't fix it :(

helmerzNL commented 4 months ago

l

That is indeed strange. Tested it myself, and from the unraid terminal it works, but from inside the docker it fails :-(

Not sure if it does work with Tandoor btw, can't get curl installed in that container :-(

tverlaan commented 3 months ago

There are some clues here on how to fix it: https://github.com/mealie-recipes/mealie/pull/3384 I've worked around the issue for now using a simple script

cathelijne commented 3 months ago

Leaving this here because I see I'm not the only one with this issue. I managed to get it to work, somewhat, by hacking together a few bits and bobs. The thing is, I can easily grab the html with a curl request from my mac and import that into mealie, but not being able to easily add recipes from Appie is devastating for the adoption rate of mealie in my family: 67% has opted out of using it, and we can't have that.

After some testing, I found out that most http clients run from linux servers are bocked (curl, python requests, even eLinks...) They run fine from my Mac. The issue is with TLS fingerprinting, as mentioned in the corresponding Mealie Issue 2888. Suggested there is the use of JA3Proxy, which I couldn't get to work. Somewhere in that repo's issues, someone mentioned having built a PoC which handles things a bit differently (I can't find that discussion anymore - sorry): https://github.com/rosahaj/tlsproxy

Using that , I can scrape recipes from Albert Heijn.

Getting it to work with mealie was a bit of a thing, because the mealie request to the scraper never hits the if html = None phase from the AlbertHeijn class. So I fiddled around a bit and came up with something that works. It's hacky, ugly and not a real solution at all, but if you are willing to experiment a little, it might work. Use at your own risk.

import re
from typing import Dict, Optional, Tuple, Union

from requests import Session

from ._abstract import AbstractScraper
from ._utils import normalize_string

# The user agent string in headers should match the one you're running tls proxy with.
HEADERS = {
    'Accept-Language': 'nl',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:55.0) Gecko/20100101 Firefox/55.0',
}

# running from port 3128, because 8080 is already taken by mealie
proxies = {
    'https': 'http://localhost:3128',
}

# tls proxy creates certs at startup. These obviously aren't trusted by the mealie container by default. Since this is
# a personal project, I've not bothered to add them to my cert store, and run requests to the proxy unverified.
cert_reqs = 'CERT_NONE'

class AlbertHeijn(AbstractScraper):
    def __init__(
        self,
        url: str,
        proxies: Dict[str, str] = proxies or None,  # allows us to specify optional proxy server
        timeout: Optional[Union[float, Tuple[float, float], Tuple[float, None]]] = None,  # allows us to specify optional timeout for request
        wild_mode: Optional[bool] = False,
        html = None
        ):
        with Session() as session:
            session.proxies.update(proxies or {})
            session.headers.update(HEADERS or {})
            session.verify = False
            session.get(url, timeout=timeout)
            html = session.get(url, timeout=timeout).content  # reload the page

        # As the html content is provided, the parent will not query the page
        super().__init__(url, proxies, timeout, wild_mode, html)

    @classmethod
    def host(cls):
        return "ah.nl"

The rest of the file is unchanged. Because of the way it's accessed, it will first try and contact Albert Heijn, and only then make the request through the proxy. You'll see that in your mealie logs.

I have added the tlsproxy container to my mealie pod, and am accessing it from mealie on localhost.

Upon startup of the pod, I overwrite the albertheijn.py file in the container with my edited version that lives in the mealie data volume before starting the python app (I did say it was all very hacky...).

To sum it up:

bilhert commented 3 months ago

you can use curl_cffi, instead of httpx or regular curl,which supports mimicking a browsers TLS fingerprint. Than you do not need a proxy.

I already build a fix for mealie where this is put to action. But it conflicted with recent security changes which made the fix less of a drop-in replacement this did not make it into the main branch.

cathelijne commented 2 months ago

you can use curl_cffi, instead of httpx or regular curl,which supports mimicking a browsers TLS fingerprint. Than you do not need a proxy.

I already build a fix for mealie where this is put to action. But it conflicted with recent security changes which made the fix less of a drop-in replacement this did not make it into the main branch.

Ah yes, thank you! I remember seeing that conversation and was looking for that it and couldn't find it again. I'll look into curl_cffi, that sounds promising, although I'm more ops than dev, so most I'm more of a scripter than a programmer really.

mobiledude commented 2 months ago

Running Nightly of today, suddenly made it possible to scrape from Allerhande.nl (AH) Website :-). AH App links(shortlinks) via app are not working. Really nice this is working!!!

version I am running now: Screenshot 2024-04-30 om 21 24 06

https://github.com/mealie-recipes/mealie/commit/c23660007eb3acc17a323d707b4abea9953b7a19 https://github.com/hhursev/recipe-scrapers/releases/tag/14.56.0

helmerzNL commented 2 months ago

Running Nightly of today, suddenly made it possible to scrape from Allerhande.nl (AH) Website :-). AH App links(shortlinks) via app are not working. Really nice this is working!!!

That sounds great! It also works here. Finally!

jayaddison commented 2 months ago

Given that the problem seems to have been resolved, I'm going to close this. I don't think we made any changes in the library to fix it, so I wouldn't be entirely surprised if it reappears.. but let's hope not. Thanks all for time investigating and developing potential solutions.