Closed ataylor32 closed 1 year ago
@ataylor32 Thanks for the suggestion, default headers implemented; but it still faces with TimeoutError
for your example. because Tiktok uses HTTP/2 which is not supported by Requests, unless you set x-requested-with: XMLHttpRequest
header.
Thank you! I just ran my example script (the one with the TikTok URL) using linkpreview 0.6.0 and it worked. I'm not sure why you got a TimeoutError
and I didn't.
Am I misunderstanding something here ? I get …
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.gak.co.uk/blog/5-tips-to-master-the-roland-s-1-tweak-synth/#three
I switched to using …
grabber = LinkGrabber(
initial_timeout=20,
maxsize=1048576,
receive_timeout=10,
chunk_size=1024,
)
content, URL = grabber.get_content(URL)
link = Link(URL, content)
preview = LinkPreview(link, parser="lxml")
print("title:", Fore.GREEN + preview.title + Fore.WHITE)
I can see a header is being used by default (?) in grabber.py.
Yet I still get the error. Am I missing something ?
@Michael-Z-Freeman
Yet I still get the error. Am I missing something ?
No, 403 comes from the Cloudflare.
Now v0.9.0 released for better headers support. in your case, use this:
content, URL = grabber.get_content(URL, headers="imessagebot")
@Michael-Z-Freeman
Yet I still get the error. Am I missing something ?
No, 403 comes from the Cloudflare.
Now v0.9.0 released for better headers support. in your case, use this:
content, URL = grabber.get_content(URL, headers="imessagebot")
OK thanks. However as I found headers alone does not solve 403’s. I ended up using Microsoft Playwright to do the grabber part and it works great ! See https://github.com/Michael-Z-Freeman/word-link-preview
Hello! Maybe another issue/question, but is it possible to parse something behind Cloudflare?
This is an example of such URL.
$ curl -I https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdev.14129
HTTP/2 403
...
server: cloudflare
@pothitos Hi, Maybe you should try something like https://github.com/FlareSolverr/FlareSolverr.
This is the test script I'm using:
When I run it, it sits there for about 20 seconds and then raises a
TimeoutError
. But if you were to provide a User-Agent header such asMozilla/5.0
by default then running the script would output the following:I realize that I can do this myself by following the "Advanced" section of the README, but I think this would be a good thing to have built into
linkpreview
since there are a lot of sites set up to reject requests that have a User-Agent header with a value likepython-requests/2.28.1
.