Open iJohnMaged opened 3 years ago
Getting the same error on windows
2 days a go that was fine but not, getting the same error.
thanks
There's a new push. Follow this: #1061
I just reinstalled twint.
I already have this update.
Hello, I did the following:
$ pip3 uninstall twint $ pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
It worked for me
Hi, I'd like to update that I solved this issue.
The problem was in Twitter not sending a token to any AWS IP, I had to manually implement a proxy in token.py for it and then it worked like a charm! :)
Hi, I'd like to update that I solved this issue.
The problem was in AWS not sending a token to any AWS IP, I had to manually implement a proxy in token.py for it and then it worked like a charm! :)
Thanks alot after that it just need to reinstalling?
@iJohnMaged Can you please evaborate how to set proxy in token.py, I am also facing same issue in aws server?
@iJohnMaged I guess this could use some elaboration.
@dipenpatel235 @innocentius Sorry for the late reply! I'll post explanation today.
@iJohnMaged @dipenpatel235 Currently the program is designed to NOT use a proxy to retrieve guest token. This means that if we are scrapping in large scale our IP will be exposed rather quickly. Yea, quite a good thought there to impliment a proxy. I just don't find a good way to do it for now...
Hello all, @innocentius @dipenpatel235 I'm very sorry for the delay, I had some health issues and I couldn't update the thread.
What I did in token.py
was adding this to __init__
method of Token
class:
self.proxies = {
"http": "YOUR_HTTP_PROXY",
"https": "YOUR_HTTPS_PROXY",
}
and in request
method, I changed the request sent to this:
r = self._session.send(
req,
allow_redirects=True,
timeout=self._timeout,
proxies=self.proxies,
verify=False,
)
This will make use of your proxies to request tokens, which is the only way I found to work on AWS instances (still working as of now).
Note this will show warning on unverified requests, you can disable this warning by adding the following to the top of the file:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
@iJohnMaged I hope the best for your health. I took my own liberty to impliment my own methods in async ways already, but your codes seems much more efficient.
I think the problem with TWINT is that it is asking for a fresh guest token for every query runned, which is quite expensive if you are trying to, say, get tweet data from 100,000 different users. However, if we are not frequently asking for tokens, twitter could easily detect scrapping attempts based on the token.
Would there be a middle ground solution about this?
@iJohnMaged I hope the best for your health. I took my own liberty to impliment my own methods in async ways already, but your codes seems much more efficient.
I think the problem with TWINT is that it is asking for a fresh guest token for every query runned, which is quite expensive if you are trying to, say, get tweet data from 100,000 different users. However, if we are not frequently asking for tokens, twitter could easily detect scrapping attempts based on the token.
Would there be a middle ground solution about this?
I haven't looked much into that, but for my use case, I was scrapping a year worth of mentions for about ~1000 different users by using handles and name, I didn't run into any issues using a rotating proxy. However with that many requests, I think it's better to refresh token regardless..
Also in my scrapping script, I'm using multi-threading to scrap 40 accounts at a time using my rotating proxy.
@iJohnMaged I hope the best for your health. I took my own liberty to impliment my own methods in async ways already, but your codes seems much more efficient. I think the problem with TWINT is that it is asking for a fresh guest token for every query runned, which is quite expensive if you are trying to, say, get tweet data from 100,000 different users. However, if we are not frequently asking for tokens, twitter could easily detect scrapping attempts based on the token. Would there be a middle ground solution about this?
I haven't looked much into that, but for my use case, I was scrapping a year worth of mentions for about ~1000 different users by using handles and name, I didn't run into any issues using a rotating proxy. However with that many requests, I think it's better to refresh token regardless..
Also in my scrapping script, I'm using multi-threading to scrap 40 accounts at a time using my rotating proxy.
Yea, we are definitely using a similar method of scrapping. I use rotating proxies to scrap more than 100,000 accounts for all historical data... Twitter stop sending token to my ip after 1 hour (if I don't use proxy for the token). Therefore I predict that if you are using a small enough proxy pool to get the token eventually you will run into the same issue.
@innocentius @iJohnMaged I'm facing similar issues on AWS. I've been using Torpy based on @himanshudabas's branch (twint-fixes) but it's unstable, as sometimes it fails to get the Tor session.
Can you elaborate more on your rotating proxies setup? Where do you get the proxy pool from?
Thanks!
@karabi You can use one of the proxy pool repos on github and try to use the free proxies. Although it is rumored that free proxy is only like 10% usable so I wouldn't count on it. I'm using paid proxy services, and it is not cheap tbh.
Thanks @innocentius , yeah I also read that free ones are unreliable. Do you have any recommendations on a paid one?
@karabi There are many services available, I personally use the datacenter rotating-ip proxies by luminati.io, it is comparably cheap (0.6USD/GB) and data center is good enough for the defense of twitter. Many other options are also available, try to browse and compare.
@innocentius thanks a lot, I will try that
@karabi I'm using crawlera, it's not free, but it works great for my use-case.
Issue Template
Please use this template!
Initial Check
pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
;Command Ran
twint -s test
Description of Issue
Getting
Could not find the Guest token in HTML
whenever I run the command under linux, however working perfectly with Windows, same python version and requirements installed, I printed the response intoken.py
and it's identical in both except the last script tag</script><script nonce="xyz">document.cookie = decodeURIComponent("gt=xyz; Max-Age=10800; Domain=.twitter.com; Path=/; Secure");</script>
is missing on Linux.Environment Details
Windows 10 and Debian 10