twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.85k stars 2.73k forks source link

Twitter gives 503 Error for large queries #141

Closed Nestor75 closed 6 years ago

Nestor75 commented 6 years ago

Issue Template

Please use this template!

Initial Check

If the issue is a request please specify that it is a request in the title (Example: [REQUEST] more features). If this is a question regarding 'twint' please specify that it's a question in the title (Example: [QUESTION] What is x?). Please only submit issues related to 'twint'. Thanks.

Make sure you've checked the following:

Command Ran

import  twint
c = twint.Config()
c.Count = True
output csv
c.Output = "file.csv"
c.Store_csv = True
c.search_name ="testmysql1_en"

c.Search = "elonmusk%20OR%20testla%20OR%20electric%20OR%20car%20OR%20apple%20OR%20volvo%20OR%20uber"

c.Since = "2017-09-01"
c.Until = "2017-10-27"
c.Lang = "en"
twint.run.Search(c) 

it gives only returns 25831 tweets from only these not consecutive days 2017-10-27, 2017-10-26 and 2017-09-27

it makes me think there is a problem with how the system manages the dates.

if I use the same parameters but changing the since and until parameters by: c.Since = "2017-10-23" c.Until = "2017-10-24"

then I get 81557 tweets from 2017-10-24 and 2017-10-23 Why did not appear in the previous search? this time frame is included in the previous one

and for doing it more weird... if I use the first time frame but searching for a not common stuff

import  twint
c = twint.Config()
c.Count = True
output csv
c.Output = "file.csv"
c.Store_csv = True
c.search_name ="testmysql3_en"
c.Search = "gimeno"
c.Since = "2017-09-01"
c.Until = "2017-10-27"
c.Lang = "en"
twint.run.Search(c) 

I get 670 tweets from all these days: '2017-08-28' '2017-08-29' '2017-08-30' '2017-08-31' '2017-09-01' '2017-09-02' '2017-09-03' '2017-09-04' '2017-09-05' '2017-09-06' '2017-09-07' '2017-09-08' '2017-09-09' '2017-09-10' '2017-09-11' '2017-09-12' '2017-09-13' '2017-09-14' '2017-09-15' '2017-09-16' '2017-09-17' '2017-09-18' '2017-09-19' '2017-09-20' '2017-09-21' '2017-09-22' '2017-09-23' '2017-09-24' '2017-09-25' '2017-09-26' '2017-09-27' '2017-09-28' '2017-09-29' '2017-09-30' '2017-10-01' '2017-10-02' '2017-10-03' '2017-10-04' '2017-10-05' '2017-10-06' '2017-10-07' '2017-10-08' '2017-10-09' '2017-10-10' '2017-10-11' '2017-10-12' '2017-10-13' '2017-10-14' '2017-10-15' '2017-10-16' '2017-10-17' '2017-10-18' '2017-10-19' '2017-10-20' '2017-10-21' '2017-10-22' '2017-10-23' '2017-10-24' '2017-10-25' '2017-10-26' '2017-10-27'

Description of Issue

Please use as much detail as possible. it only gets tweets from some not consecutive days when you search for more than a month and searching for common words with tones of results

Environment Details

Using Windows, Linux? Linux What OS version? ubuntu Running this in Anaconda? NO Jupyter Notebook? No Terminal? Yes

haccer commented 6 years ago

I'll look into this while I look at https://github.com/haccer/twint/issues/106

haccer commented 6 years ago

I just noticed Twitter has added a new "Quality Filter"... maybe that has something to do with it; I just explicitly turned it off.

Nestor75 commented 6 years ago

I read the #106 and it could be related to but even the problem likes the same, the way to do the search and the parameters were different and that's why I opened new issue. Please feel free in closing it as I am also following the #106

Nestor75 commented 6 years ago

after applying the last commit you did, I launched again the first query and the result was not really much better. this is the code:

c = twint.Config() c.Count = True output csv c.Output = "file.csv" c.Store_csv = True c.search_name ="testmysql1_en"

c.Search = "elonmusk%20OR%20testla%20OR%20electric%20OR%20car%20OR%20apple%20OR%20volvo%20OR%20uber"

c.Since = "2017-09-01" c.Until = "2017-10-27" c.Lang = "en" twint.run.Search(c)

before the change this was the result: it only returned 25831 tweets from only these not consecutive days 2017-09-27 2017-10-26 2017-10-27

after the change: it returned 181.687 tweets which is a really bigger number than before but only for four days: '2017-09-26' '2017-09-27' '2017-10-26' '2017-10-27'

and like I checked previously, if y use c.Since = "2017-10-23" c.Until = "2017-10-24" with the same query I got 81557 tweets so..... it seems there is something else :(

haccer commented 6 years ago

I suspect since this might be related to #106, I'm running the first scripts above with the master branch on GitHub and the current stable release on PyPI.

haccer commented 6 years ago

PIP script stopped at 19234

Master branch script still going

haccer commented 6 years ago

It might take me a while to attempt to debug this.... Large queries wern't necessarily a problem in the past as I collected like 650k Tweets from the #metoo search...

Nestor75 commented 6 years ago

In the past I also got more than 500K tweets without any problem and I always download the code from GITHUB, I don't use PIP for getting twint.

haccer commented 6 years ago

Right now the master run has reached the point where the pip version ended which is significantly more tweets... I suspect the qualify filter option i recently added has something to do with that

I've currently collected 263932 tweets using the version of Twint on the master branch. The script I'm using is the same as your first one which initially had the problem

import  twint
c = twint.Config()
c.Count = True
output csv
c.Output = "file.csv"
c.Store_csv = True
c.search_name ="testmysql1_en"

c.Search = "elonmusk%20OR%20testla%20OR%20electric%20OR%20car%20OR%20apple%20OR%20volvo%20OR%20uber"

c.Since = "2017-09-01"
c.Until = "2017-10-27"
c.Lang = "en"
twint.run.Search(c) 
Nestor75 commented 6 years ago

I use the version with the qualifier update and I got more tweets but only from 4 days and the time range were almost 2 months :(

Nestor75 commented 6 years ago

I have perform another test with the qualifier off.

The query was exactly the same for the four test I did and I only changed the time frames. 1st test called testmysql1_en : c.Since = "2017-09-01" c.Until = "2017-10-27" 2nd test called testmysql2_en: c.Since = "2017-10-20" c.Until = "2017-10-27" 3rd test called testmysql3_en: c.Since = "2017-10-10" c.Until = "2017-10-19" 4th test called testmysql4_en: c.Since = "2017-10-01" c.Until = "2017-10-09"

the query was:

import  twint
#info comun
c = twint.Config()
c.Count = True
#output database

c.search_name ="testmysql1_en"
c.Search = "elonmusk OR testla OR electri OR car OR apple OR volvo OR uber"
c.Since = "2017-09-01"
c.Until = "2017-10-27"
c.Lang = "en"
twint.run.Search(c) 

the number of tweets obtained in each test are:

COUNT(`tweets`.`id`),  Search
'26552', 'testmysql1_en'
'105321', 'testmysql2_en'
'25661', 'testmysql3_en'
'25852', 'testmysql4_en'

the tweets obtained per day and per test are:

count(*), date, Search
'279', '2017-09-27', 'testmysql1_en'
'10404', '2017-10-26', 'testmysql1_en'
'15869', '2017-10-27', 'testmysql1_en'
'89451', '2017-10-26', 'testmysql2_en'
'15870', '2017-10-27', 'testmysql2_en'
'11519', '2017-10-18', 'testmysql3_en'
'14142', '2017-10-19', 'testmysql3_en'
'12944', '2017-10-08', 'testmysql4_en'
'12908', '2017-10-09', 'testmysql4_en'

I got tweets for only two or three days and the lower time frame was a week the test one and two, the until parameter is the same and: - the test 1 obtained 10.404 tweets on 2017-10-26 when the test 2 in the same day obtained 89.451 it is very weird :( it does make sense these numbers

haccer commented 6 years ago

The script had finished w/

912305785987837952 2017-09-25 13:20:23 UTC <bellestarr48> @JoyAnnReid @nycsouthpaw This is from May....has anyone noticed that American citizens are dying....no food, water or electric !!! It's an island underwater!!!
[+] Finished: Successfully collected 346581 Tweets.
haccer commented 6 years ago

I think it's time implement some logging to better debug the last request of each run, because this stopped early, as since was c.Since = "2017-09-01"

Nestor75 commented 6 years ago

There is a debug parameter, isn't it?

Nestor75 commented 6 years ago

I closed it by mistake, I reopen it again

haccer commented 6 years ago

I removed the debug param a while ago because it needed to be redone

Nestor75 commented 6 years ago

There is a debug parameter, isn't it? How it works? Which King of log generates?

Nestor75 commented 6 years ago

Any guide about how can I debug it?

haccer commented 6 years ago

yeah if you want to debug on your own,

After line 23 in search.py before self.feed = [] put:

print(response, file=open("twint.log", "w", encoding="utf-8"))

Then when the script stops, we can look at twint.log which will contain the last response so we can try to figure out why it stopped early.

Nestor75 commented 6 years ago

I will do it, many Many thanks The point is that all the test finished successfully, with no error... Let's see what the log will register

haccer commented 6 years ago
          <div class="errorpage-canvas">
            <img class= "errorpage-illustration errorpage-robot" src= "https://abs.twimg.com/errors/robot.png" >
          </div>
          <h1 id="title">Something is technically wrong.</h1>
          <p id="desc">Thanks for noticing&mdash;we're going to fix it up and have things back to normal soon.</p>

This showed up in the twint.log when the run stopped.... this is likely due to the amount of requests being sent lol... SO what I can do is write something that'll detect this error, wait a little bit (10 second increments) then try again to see if it'll work. Sound good?

haccer commented 6 years ago

this is likely the cause of the issue for #106 too

Nestor75 commented 6 years ago

I run the queries again last night and hereby attached the final logs these are the results of number of tweets per test: COUNT(tweets.id), search '26552', 'testmysql1_en' '22992', 'testmysql1_en_log' '105321', 'testmysql2_en' '20312', 'testmysql2_en_log' '25661', 'testmysql3_en' '19991', 'testmysql3_en_log' '25852', 'testmysql4_en' '20357', 'testmysql4_en_log'

and per days: count(*), date, Search '279', '2017-09-27', 'testmysql1_en' '10404', '2017-10-26', 'testmysql1_en' '15869', '2017-10-27', 'testmysql1_en' '7159', '2017-10-26', 'testmysql1_en_log' '15833', '2017-10-27', 'testmysql1_en_log' '89451', '2017-10-26', 'testmysql2_en' '15870', '2017-10-27', 'testmysql2_en' '4479', '2017-10-26', 'testmysql2_en_log' '15833', '2017-10-27', 'testmysql2_en_log' '11519', '2017-10-18', 'testmysql3_en' '14142', '2017-10-19', 'testmysql3_en' '5856', '2017-10-18', 'testmysql3_en_log' '14135', '2017-10-19', 'testmysql3_en_log' '12944', '2017-10-08', 'testmysql4_en' '12908', '2017-10-09', 'testmysql4_en' '7472', '2017-10-08', 'testmysql4_en_log' '12885', '2017-10-09', 'testmysql4_en_log'

it finished with error and not always fails at the same point even the query is exactly the same.

testmysql1_en_log.log testmysql2_en_log.log testmysql3_en_log.log testmysql4_en_log.log

Nestor75 commented 6 years ago

the solution you proposed sounds great :)

Nestor75 commented 6 years ago

one comment over here even it might be does nothing to do with that. the master branch dor not have the quialifer of, file url.py line 49: url += "reset_error_state=false&src=typ&max_position={}&q=".format(init)

but you tried it ... in the mysql branch it is off.

Which one is finally the good one?

url += "reset_error_state=false&src=typd&qf=off&max_position={}&q=".format(init)

o url += "reset_error_state=false&src=typd&max_position={}&q=".format(init)

haccer commented 6 years ago

Yeah the master was reverted, in the current version I'm testing i have qf=off

Nestor75 commented 6 years ago

I am doing some tests increasing the timeout and with this code in the search.py file:

async def Feed(self): response = await get.RequestUrl(self.config, self.init) print(response, file=open(self.config.search_name+".log", "w", encoding="utf-8"))#debug if "Twitter / Error" in response: time.sleep(30) response = await get.RequestUrl(self.config, self.init) print("1", file=open(self.config.search_name+"error.log", "a", encoding="utf-8"))#debug print(response, file=open(self.config.search_name+".log", "w", encoding="utf-8"))#debug self.feed = [] try: self.feed, self.init = feed.Json(response) except: pass

To be honest I am not sure what I am doing and it it will improve anything..... for the moments it gathered more than 100K tweets but there is something very weird, tha dates jump from 26 Oct to 27 Sep:

count(*), date, Search

'49264', '2017-09-26', 'testmysql1_en_log_1' '16916', '2017-09-27', 'testmysql1_en_log_1' '99842', '2017-10-26', 'testmysql1_en_log_1' '15874', '2017-10-27', 'testmysql1_en_log_1'

I will share the outcome once it finish

Nestor75 commented 6 years ago

well, I have the outcome almost then 700K (694986) tweets gathered but only from 6 days when the time frame was 2017-09-01 2017-10-01, 2 days from October and 5 days from September

count(*), date, Search '34536', '2017-09-23', 'testmysql1_en_log_1' '154594', '2017-09-24', 'testmysql1_en_log_1' '183891', '2017-09-25', 'testmysql1_en_log_1' '189733', '2017-09-26', 'testmysql1_en_log_1' '16959', '2017-09-27', 'testmysql1_en_log_1' '100021', '2017-10-26', 'testmysql1_en_log_1' '15911', '2017-10-27', 'testmysql1_en_log_1'

it registered 7 "Twitter / Error".... and it is the same number of days ... could it be causality but....

the last log registered is : {"min_position":"TWEET-911651435296825344-923700770066776065","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}

haccer commented 6 years ago

yeah,

{"min_position":"TWEET-911651435296825344-923700770066776065","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}

is the response when there's no more tweets for the query and should be the last response

Nestor75 commented 6 years ago

It doesn't make sense because the time frame was from 2017-09-01 until 2017-10-27 and there are only results from 7 days, 2 from October (27 and 26) and 5 from September (27,26,25,24 and 23)

What about the rest of the days???? there are only result in 7 of the almost 60 days I asked for and I checked that there are tweets in , for example 24 and 23 of October and they don't appear in that query.....

the code was: import twint

info comun

c = twint.Config() c.Count = True

output database

c.search_name ="testmysql1_en_log_1" c.Search = "elonmusk OR testla OR electri OR car OR apple OR volvo OR uber" c.Since = "2017-09-01" c.Until = "2017-10-27" c.Lang = "en" twint.run.Search(c)

and if I run it again, the most probably is there will be different number of tweets and in different days so there should be a problem somewhere when the query gives many many results and the time frame in bigger than two days because if you reduce it to a week... you don't get also the tweets from all the days. A workaround would be query only two days by two days like 2017-09-01 until 2017-09-03 and then from 2017-09-03 until 2017-09-05 and so on but.....

Nestor75 commented 6 years ago

I think I found out where the problem is with very large queries. If you don't specify a Timedelta, by default is 30 days. Therefore the first bath in the time frame I specified is 2017-09-27 until 2017-10-27. the outcome I got is for example something like that: there are only results from 7 days, 2 from October (27 and 26) and 5 from September (27,26,25,24 and 23).

I tried to bypass the timedelta stuff to go over the whole time frame but then I only go 3 days from October .... The last thing I am trying is specify a timedelta = 2 days and it seems it is working fine, I think it will get the tweets from all the days in the timeframe.

the code I am referring to is in the search.py : I write some prints for debugging

async def main(self): if self.config.User_id is not None: self.config.Username = await get.Username(self.config.User_id)

    if self.config.Since and self.config.Until:
        _days = timedelta(days=int(self.config.Timedelta))
        print("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$")#debug
        print(_days)#debug
        while self.d._since < self.d._until:
            print( self.d._since)#debug
            print(self.d._until)#debug
            self.config.Since = str(self.d._until - _days)
            print(self.config.Since)
            self.config.Until = str(self.d._until)
            print(self.config.Until)
            if len(self.feed) > 0:
                await self.tweets()
            else:
                self.d._until = self.d._until - _days
                self.feed = [-1]

            if get.Limit(self.config.Limit, self.count):
                self.d._until = self.d._until - _days
                self.feed = [-1]
    else:
        while True:
            if len(self.feed) > 0:
                await self.tweets()
            else:
                break

            if get.Limit(self.config.Limit, self.count):
                break

    verbose.Count(self.config, self.count)

I will share the results when the query will finish and I will do some more test....as the queries are very large it will take some hours to get them

DavidPerea commented 6 years ago

Do you know if it has been able to solve something realted to this problem an that of #106?

Nestor75 commented 6 years ago

I don't know.

what I can say is using two days deltatime you can get tweets from all the time frame. In my case I go almost 1.700.000 tweets from almost 60 days with out any problem. it's a really huge number.

DavidPerea commented 6 years ago

What do you mean with deltatime? Could you give me an example? please

Nestor75 commented 6 years ago

there is a parameter in twint called timedelta

-t, --timedelta | Time interval for every request

by default is 30 days and when you ask twitter for tweets during an specific time, it request in 30 days batches. it seems that when the query is too large and with the default 20 days value, you don't get all the data properly. so...if you use --t 2 you request data using 2 days batches so in my case it worked very well.

you could try and let's see what happens.

DavidPerea commented 6 years ago

I tried to do it as you said, but nothing. I keep getting 3110 tweets. It does not let me get older tweets.

Nestor75 commented 6 years ago

I tried the comand you expose in #106 but with count and timedelta parameters:

python twint.py -u malaga -o file.csv --csv --count -t 2

and I stopped it manually when it got almost 7K tweets

are you sure you are using the last version?

screenshot_1

DavidPerea commented 6 years ago

Sorry, I explained myself wrongly. You are absolutely right, with the way you have previously commented it works perfectly.

However, it does not work when the command --profile-full is added. With this added command, to get retweets too, it does not exceed 3000 tweets.

Nestor75 commented 6 years ago

I will check it and I will share the outcome....let's see how it works for me

haccer commented 6 years ago

Closing this since the new --resume option mitigates this.

It's on the dev branch right now, I'll merge with the master once the dev branch seems stable