twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.71k stars 2.72k forks source link

Getting `Could not find the Guest token in HTML` under linux. #1084

Open iJohnMaged opened 3 years ago

iJohnMaged commented 3 years ago

Issue Template

Please use this template!

Initial Check

If the issue is a request please specify that it is a request in the title (Example: [REQUEST] more features). If this is a question regarding 'twint' please specify that it's a question in the title (Example: [QUESTION] What is x?). Please only submit issues related to 'twint'. Thanks.

Make sure you've checked the following:

Command Ran

Please provide the exact command ran including the username/search/code so I may reproduce the issue.

twint -s test

Description of Issue

Please use as much detail as possible.

Getting Could not find the Guest token in HTML whenever I run the command under linux, however working perfectly with Windows, same python version and requirements installed, I printed the response in token.py and it's identical in both except the last script tag </script><script nonce="xyz">document.cookie = decodeURIComponent("gt=xyz; Max-Age=10800; Domain=.twitter.com; Path=/; Secure");</script> is missing on Linux.

Environment Details

Using Windows, Linux? What OS version? Running this in Anaconda? Jupyter Notebook? Terminal?

Windows 10 and Debian 10

senanabs commented 3 years ago

Getting the same error on windows

mnwato commented 3 years ago

2 days a go that was fine but not, getting the same error.

thanks

senanabs commented 3 years ago

There's a new push. Follow this: #1061

I just reinstalled twint.

iJohnMaged commented 3 years ago

There's a new push. Follow this: #1061

I just reinstalled twint.

I already have this update.

alexfrancow commented 3 years ago

Hello, I did the following:

$ pip3 uninstall twint $ pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

It worked for me

iJohnMaged commented 3 years ago

Hi, I'd like to update that I solved this issue.

The problem was in Twitter not sending a token to any AWS IP, I had to manually implement a proxy in token.py for it and then it worked like a charm! :)

mnwato commented 3 years ago

Hi, I'd like to update that I solved this issue.

The problem was in AWS not sending a token to any AWS IP, I had to manually implement a proxy in token.py for it and then it worked like a charm! :)

Thanks alot after that it just need to reinstalling?

dipenpatel235 commented 3 years ago

@iJohnMaged Can you please evaborate how to set proxy in token.py, I am also facing same issue in aws server?

innocentius commented 3 years ago

@iJohnMaged I guess this could use some elaboration.

iJohnMaged commented 3 years ago

@dipenpatel235 @innocentius Sorry for the late reply! I'll post explanation today.

innocentius commented 3 years ago

@iJohnMaged @dipenpatel235 Currently the program is designed to NOT use a proxy to retrieve guest token. This means that if we are scrapping in large scale our IP will be exposed rather quickly. Yea, quite a good thought there to impliment a proxy. I just don't find a good way to do it for now...

iJohnMaged commented 3 years ago

Hello all, @innocentius @dipenpatel235 I'm very sorry for the delay, I had some health issues and I couldn't update the thread.

What I did in token.py was adding this to __init__ method of Token class:

self.proxies = {
            "http": "YOUR_HTTP_PROXY",
            "https": "YOUR_HTTPS_PROXY",
        }

and in request method, I changed the request sent to this:

r = self._session.send(
                    req,
                    allow_redirects=True,
                    timeout=self._timeout,
                    proxies=self.proxies,
                    verify=False,
                )

This will make use of your proxies to request tokens, which is the only way I found to work on AWS instances (still working as of now).

Note this will show warning on unverified requests, you can disable this warning by adding the following to the top of the file:

import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
innocentius commented 3 years ago

@iJohnMaged I hope the best for your health. I took my own liberty to impliment my own methods in async ways already, but your codes seems much more efficient.

I think the problem with TWINT is that it is asking for a fresh guest token for every query runned, which is quite expensive if you are trying to, say, get tweet data from 100,000 different users. However, if we are not frequently asking for tokens, twitter could easily detect scrapping attempts based on the token.

Would there be a middle ground solution about this?

iJohnMaged commented 3 years ago

@iJohnMaged I hope the best for your health. I took my own liberty to impliment my own methods in async ways already, but your codes seems much more efficient.

I think the problem with TWINT is that it is asking for a fresh guest token for every query runned, which is quite expensive if you are trying to, say, get tweet data from 100,000 different users. However, if we are not frequently asking for tokens, twitter could easily detect scrapping attempts based on the token.

Would there be a middle ground solution about this?

I haven't looked much into that, but for my use case, I was scrapping a year worth of mentions for about ~1000 different users by using handles and name, I didn't run into any issues using a rotating proxy. However with that many requests, I think it's better to refresh token regardless..

Also in my scrapping script, I'm using multi-threading to scrap 40 accounts at a time using my rotating proxy.

innocentius commented 3 years ago

@iJohnMaged I hope the best for your health. I took my own liberty to impliment my own methods in async ways already, but your codes seems much more efficient. I think the problem with TWINT is that it is asking for a fresh guest token for every query runned, which is quite expensive if you are trying to, say, get tweet data from 100,000 different users. However, if we are not frequently asking for tokens, twitter could easily detect scrapping attempts based on the token. Would there be a middle ground solution about this?

I haven't looked much into that, but for my use case, I was scrapping a year worth of mentions for about ~1000 different users by using handles and name, I didn't run into any issues using a rotating proxy. However with that many requests, I think it's better to refresh token regardless..

Also in my scrapping script, I'm using multi-threading to scrap 40 accounts at a time using my rotating proxy.

Yea, we are definitely using a similar method of scrapping. I use rotating proxies to scrap more than 100,000 accounts for all historical data... Twitter stop sending token to my ip after 1 hour (if I don't use proxy for the token). Therefore I predict that if you are using a small enough proxy pool to get the token eventually you will run into the same issue.

rgb-panda commented 3 years ago

@innocentius @iJohnMaged I'm facing similar issues on AWS. I've been using Torpy based on @himanshudabas's branch (twint-fixes) but it's unstable, as sometimes it fails to get the Tor session.

Can you elaborate more on your rotating proxies setup? Where do you get the proxy pool from?

Thanks!

innocentius commented 3 years ago

@karabi You can use one of the proxy pool repos on github and try to use the free proxies. Although it is rumored that free proxy is only like 10% usable so I wouldn't count on it. I'm using paid proxy services, and it is not cheap tbh.

rgb-panda commented 3 years ago

Thanks @innocentius , yeah I also read that free ones are unreliable. Do you have any recommendations on a paid one?

innocentius commented 3 years ago

@karabi There are many services available, I personally use the datacenter rotating-ip proxies by luminati.io, it is comparably cheap (0.6USD/GB) and data center is good enough for the defense of twitter. Many other options are also available, try to browse and compare.

rgb-panda commented 3 years ago

@innocentius thanks a lot, I will try that

iJohnMaged commented 3 years ago

@karabi I'm using crawlera, it's not free, but it works great for my use-case.