Closed Noxeus closed 2 months ago
Just a question (as a noob), any idea when this PR will be eh...part of the release?
@Noxeus can you confirm whether this approach still works for you? I received an 'access denied' result when attempting to retrieve a recipe with this code and HTTP headers.
r = scrape_me("https://www.ah.nl/allerhande/recept/R-R1198673/eenpanspasta-al-limone-met-hazelnootkruim")
r.title()
'Eenpanspasta al limone met hazelnootkruim'
But like I said here . On wsl it doesn't (code is the same). Very strange.
Any progress? :-)
Any progress? :-)
For me, it still works... Like I said, I can't figure out the differences between my Windows Python (works) and WSL2 Python (doesn't)
Any progress? :-)
For me, it still works... Like I said, I can't figure out the differences between my Windows Python (works) and WSL2 Python (doesn't)
But the check failed, so it will not be committed to the branch?
any updates on when this will be commited? :-) @jayaddison ?
Any progress?
I'm going to be totally honest: I don't think I can fix this. I tried every angle that I know, but the server just blocks my requests on some platforms and on some it doesn't, even though I use the same IP and headers. So for example, I can use
curl 'https://www.ah.nl/allerhande/recept/R-R1197770/koffie-panna-cotta-met-koffiegelei-en-chocola' --compressed -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:123.0) Gecko/20100101 Firefox/123.0' -H 'Accept-Language: nl'
on Mac
but in my mealie docker bash it fails with 403
status (again, uses same IP).
I can't fix it :(
l
That is indeed strange. Tested it myself, and from the unraid terminal it works, but from inside the docker it fails :-(
Not sure if it does work with Tandoor btw, can't get curl installed in that container :-(
There are some clues here on how to fix it: https://github.com/mealie-recipes/mealie/pull/3384 I've worked around the issue for now using a simple script
Leaving this here because I see I'm not the only one with this issue. I managed to get it to work, somewhat, by hacking together a few bits and bobs. The thing is, I can easily grab the html with a curl request from my mac and import that into mealie, but not being able to easily add recipes from Appie is devastating for the adoption rate of mealie in my family: 67% has opted out of using it, and we can't have that.
After some testing, I found out that most http clients run from linux servers are bocked (curl, python requests, even eLinks...) They run fine from my Mac. The issue is with TLS fingerprinting, as mentioned in the corresponding Mealie Issue 2888. Suggested there is the use of JA3Proxy, which I couldn't get to work. Somewhere in that repo's issues, someone mentioned having built a PoC which handles things a bit differently (I can't find that discussion anymore - sorry): https://github.com/rosahaj/tlsproxy
Using that , I can scrape recipes from Albert Heijn.
Getting it to work with mealie was a bit of a thing, because the mealie request to the scraper never hits the if html = None
phase from the AlbertHeijn class. So I fiddled around a bit and came up with something that works. It's hacky, ugly and not a real solution at all, but if you are willing to experiment a little, it might work. Use at your own risk.
import re
from typing import Dict, Optional, Tuple, Union
from requests import Session
from ._abstract import AbstractScraper
from ._utils import normalize_string
# The user agent string in headers should match the one you're running tls proxy with.
HEADERS = {
'Accept-Language': 'nl',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:55.0) Gecko/20100101 Firefox/55.0',
}
# running from port 3128, because 8080 is already taken by mealie
proxies = {
'https': 'http://localhost:3128',
}
# tls proxy creates certs at startup. These obviously aren't trusted by the mealie container by default. Since this is
# a personal project, I've not bothered to add them to my cert store, and run requests to the proxy unverified.
cert_reqs = 'CERT_NONE'
class AlbertHeijn(AbstractScraper):
def __init__(
self,
url: str,
proxies: Dict[str, str] = proxies or None, # allows us to specify optional proxy server
timeout: Optional[Union[float, Tuple[float, float], Tuple[float, None]]] = None, # allows us to specify optional timeout for request
wild_mode: Optional[bool] = False,
html = None
):
with Session() as session:
session.proxies.update(proxies or {})
session.headers.update(HEADERS or {})
session.verify = False
session.get(url, timeout=timeout)
html = session.get(url, timeout=timeout).content # reload the page
# As the html content is provided, the parent will not query the page
super().__init__(url, proxies, timeout, wild_mode, html)
@classmethod
def host(cls):
return "ah.nl"
The rest of the file is unchanged. Because of the way it's accessed, it will first try and contact Albert Heijn, and only then make the request through the proxy. You'll see that in your mealie logs.
I have added the tlsproxy container to my mealie pod, and am accessing it from mealie on localhost.
Upon startup of the pod, I overwrite the albertheijn.py file in the container with my edited version that lives in the mealie data volume before starting the python app (I did say it was all very hacky...).
To sum it up:
you can use curl_cffi, instead of httpx or regular curl,which supports mimicking a browsers TLS fingerprint. Than you do not need a proxy.
I already build a fix for mealie where this is put to action. But it conflicted with recent security changes which made the fix less of a drop-in replacement this did not make it into the main branch.
you can use curl_cffi, instead of httpx or regular curl,which supports mimicking a browsers TLS fingerprint. Than you do not need a proxy.
I already build a fix for mealie where this is put to action. But it conflicted with recent security changes which made the fix less of a drop-in replacement this did not make it into the main branch.
Ah yes, thank you! I remember seeing that conversation and was looking for that it and couldn't find it again. I'll look into curl_cffi, that sounds promising, although I'm more ops than dev, so most I'm more of a scripter than a programmer really.
Running Nightly of today, suddenly made it possible to scrape from Allerhande.nl (AH) Website :-). AH App links(shortlinks) via app are not working. Really nice this is working!!!
version I am running now:
https://github.com/mealie-recipes/mealie/commit/c23660007eb3acc17a323d707b4abea9953b7a19 https://github.com/hhursev/recipe-scrapers/releases/tag/14.56.0
Running Nightly of today, suddenly made it possible to scrape from Allerhande.nl (AH) Website :-). AH App links(shortlinks) via app are not working. Really nice this is working!!!
That sounds great! It also works here. Finally!
Given that the problem seems to have been resolved, I'm going to close this. I don't think we made any changes in the library to fix it, so I wouldn't be entirely surprised if it reappears.. but let's hope not. Thanks all for time investigating and developing potential solutions.
Guess I'll just copy the whole
__init__
from super then :) Also, for feature readers: it seems ah.nl actively blocks some servers. My mealie runs on oracle and won't accept my GET, whatever headers I throw at it. From home/Windows it works fine. But then from wsl it doesn't work. What more can we do to mask our approach?