Closed Nestor75 closed 6 years ago
I'll look into this while I look at https://github.com/haccer/twint/issues/106
I just noticed Twitter has added a new "Quality Filter"... maybe that has something to do with it; I just explicitly turned it off.
I read the #106 and it could be related to but even the problem likes the same, the way to do the search and the parameters were different and that's why I opened new issue. Please feel free in closing it as I am also following the #106
after applying the last commit you did, I launched again the first query and the result was not really much better. this is the code:
c = twint.Config() c.Count = True output csv c.Output = "file.csv" c.Store_csv = True c.search_name ="testmysql1_en"
c.Search = "elonmusk%20OR%20testla%20OR%20electric%20OR%20car%20OR%20apple%20OR%20volvo%20OR%20uber"
c.Since = "2017-09-01" c.Until = "2017-10-27" c.Lang = "en" twint.run.Search(c)
before the change this was the result: it only returned 25831 tweets from only these not consecutive days 2017-09-27 2017-10-26 2017-10-27
after the change: it returned 181.687 tweets which is a really bigger number than before but only for four days: '2017-09-26' '2017-09-27' '2017-10-26' '2017-10-27'
and like I checked previously, if y use c.Since = "2017-10-23" c.Until = "2017-10-24" with the same query I got 81557 tweets so..... it seems there is something else :(
I suspect since this might be related to #106, I'm running the first scripts above with the master branch on GitHub and the current stable release on PyPI.
PIP script stopped at 19234
Master branch script still going
It might take me a while to attempt to debug this.... Large queries wern't necessarily a problem in the past as I collected like 650k Tweets from the #metoo search...
In the past I also got more than 500K tweets without any problem and I always download the code from GITHUB, I don't use PIP for getting twint.
Right now the master run has reached the point where the pip version ended which is significantly more tweets... I suspect the qualify filter option i recently added has something to do with that
I've currently collected 263932 tweets using the version of Twint on the master branch. The script I'm using is the same as your first one which initially had the problem
import twint
c = twint.Config()
c.Count = True
output csv
c.Output = "file.csv"
c.Store_csv = True
c.search_name ="testmysql1_en"
c.Search = "elonmusk%20OR%20testla%20OR%20electric%20OR%20car%20OR%20apple%20OR%20volvo%20OR%20uber"
c.Since = "2017-09-01"
c.Until = "2017-10-27"
c.Lang = "en"
twint.run.Search(c)
I use the version with the qualifier update and I got more tweets but only from 4 days and the time range were almost 2 months :(
I have perform another test with the qualifier off.
The query was exactly the same for the four test I did and I only changed the time frames. 1st test called testmysql1_en : c.Since = "2017-09-01" c.Until = "2017-10-27" 2nd test called testmysql2_en: c.Since = "2017-10-20" c.Until = "2017-10-27" 3rd test called testmysql3_en: c.Since = "2017-10-10" c.Until = "2017-10-19" 4th test called testmysql4_en: c.Since = "2017-10-01" c.Until = "2017-10-09"
the query was:
import twint
#info comun
c = twint.Config()
c.Count = True
#output database
c.search_name ="testmysql1_en"
c.Search = "elonmusk OR testla OR electri OR car OR apple OR volvo OR uber"
c.Since = "2017-09-01"
c.Until = "2017-10-27"
c.Lang = "en"
twint.run.Search(c)
the number of tweets obtained in each test are:
COUNT(`tweets`.`id`), Search
'26552', 'testmysql1_en'
'105321', 'testmysql2_en'
'25661', 'testmysql3_en'
'25852', 'testmysql4_en'
the tweets obtained per day and per test are:
count(*), date, Search
'279', '2017-09-27', 'testmysql1_en'
'10404', '2017-10-26', 'testmysql1_en'
'15869', '2017-10-27', 'testmysql1_en'
'89451', '2017-10-26', 'testmysql2_en'
'15870', '2017-10-27', 'testmysql2_en'
'11519', '2017-10-18', 'testmysql3_en'
'14142', '2017-10-19', 'testmysql3_en'
'12944', '2017-10-08', 'testmysql4_en'
'12908', '2017-10-09', 'testmysql4_en'
I got tweets for only two or three days and the lower time frame was a week the test one and two, the until parameter is the same and: - the test 1 obtained 10.404 tweets on 2017-10-26 when the test 2 in the same day obtained 89.451 it is very weird :( it does make sense these numbers
The script had finished w/
912305785987837952 2017-09-25 13:20:23 UTC <bellestarr48> @JoyAnnReid @nycsouthpaw This is from May....has anyone noticed that American citizens are dying....no food, water or electric !!! It's an island underwater!!!
[+] Finished: Successfully collected 346581 Tweets.
I think it's time implement some logging to better debug the last request of each run, because this stopped early, as since was c.Since = "2017-09-01"
There is a debug parameter, isn't it?
I closed it by mistake, I reopen it again
I removed the debug param a while ago because it needed to be redone
There is a debug parameter, isn't it? How it works? Which King of log generates?
Any guide about how can I debug it?
yeah if you want to debug on your own,
After line 23 in search.py before self.feed = [] put:
print(response, file=open("twint.log", "w", encoding="utf-8"))
Then when the script stops, we can look at twint.log which will contain the last response so we can try to figure out why it stopped early.
I will do it, many Many thanks The point is that all the test finished successfully, with no error... Let's see what the log will register
<div class="errorpage-canvas">
<img class= "errorpage-illustration errorpage-robot" src= "https://abs.twimg.com/errors/robot.png" >
</div>
<h1 id="title">Something is technically wrong.</h1>
<p id="desc">Thanks for noticing—we're going to fix it up and have things back to normal soon.</p>
This showed up in the twint.log when the run stopped.... this is likely due to the amount of requests being sent lol... SO what I can do is write something that'll detect this error, wait a little bit (10 second increments) then try again to see if it'll work. Sound good?
this is likely the cause of the issue for #106 too
I run the queries again last night and hereby attached the final logs
these are the results of number of tweets per test:
COUNT(tweets
.id
), search
'26552', 'testmysql1_en'
'22992', 'testmysql1_en_log'
'105321', 'testmysql2_en'
'20312', 'testmysql2_en_log'
'25661', 'testmysql3_en'
'19991', 'testmysql3_en_log'
'25852', 'testmysql4_en'
'20357', 'testmysql4_en_log'
and per days: count(*), date, Search '279', '2017-09-27', 'testmysql1_en' '10404', '2017-10-26', 'testmysql1_en' '15869', '2017-10-27', 'testmysql1_en' '7159', '2017-10-26', 'testmysql1_en_log' '15833', '2017-10-27', 'testmysql1_en_log' '89451', '2017-10-26', 'testmysql2_en' '15870', '2017-10-27', 'testmysql2_en' '4479', '2017-10-26', 'testmysql2_en_log' '15833', '2017-10-27', 'testmysql2_en_log' '11519', '2017-10-18', 'testmysql3_en' '14142', '2017-10-19', 'testmysql3_en' '5856', '2017-10-18', 'testmysql3_en_log' '14135', '2017-10-19', 'testmysql3_en_log' '12944', '2017-10-08', 'testmysql4_en' '12908', '2017-10-09', 'testmysql4_en' '7472', '2017-10-08', 'testmysql4_en_log' '12885', '2017-10-09', 'testmysql4_en_log'
it finished with error and not always fails at the same point even the query is exactly the same.
testmysql1_en_log.log testmysql2_en_log.log testmysql3_en_log.log testmysql4_en_log.log
the solution you proposed sounds great :)
one comment over here even it might be does nothing to do with that. the master branch dor not have the quialifer of, file url.py line 49: url += "reset_error_state=false&src=typ&max_position={}&q=".format(init)
but you tried it ... in the mysql branch it is off.
Which one is finally the good one?
url += "reset_error_state=false&src=typd&qf=off&max_position={}&q=".format(init)
o url += "reset_error_state=false&src=typd&max_position={}&q=".format(init)
Yeah the master was reverted, in the current version I'm testing i have qf=off
I am doing some tests increasing the timeout and with this code in the search.py file:
async def Feed(self): response = await get.RequestUrl(self.config, self.init) print(response, file=open(self.config.search_name+".log", "w", encoding="utf-8"))#debug if "Twitter / Error" in response: time.sleep(30) response = await get.RequestUrl(self.config, self.init) print("1", file=open(self.config.search_name+"error.log", "a", encoding="utf-8"))#debug print(response, file=open(self.config.search_name+".log", "w", encoding="utf-8"))#debug self.feed = [] try: self.feed, self.init = feed.Json(response) except: pass
To be honest I am not sure what I am doing and it it will improve anything..... for the moments it gathered more than 100K tweets but there is something very weird, tha dates jump from 26 Oct to 27 Sep:
'49264', '2017-09-26', 'testmysql1_en_log_1' '16916', '2017-09-27', 'testmysql1_en_log_1' '99842', '2017-10-26', 'testmysql1_en_log_1' '15874', '2017-10-27', 'testmysql1_en_log_1'
I will share the outcome once it finish
well, I have the outcome almost then 700K (694986) tweets gathered but only from 6 days when the time frame was 2017-09-01 2017-10-01, 2 days from October and 5 days from September
count(*), date, Search '34536', '2017-09-23', 'testmysql1_en_log_1' '154594', '2017-09-24', 'testmysql1_en_log_1' '183891', '2017-09-25', 'testmysql1_en_log_1' '189733', '2017-09-26', 'testmysql1_en_log_1' '16959', '2017-09-27', 'testmysql1_en_log_1' '100021', '2017-10-26', 'testmysql1_en_log_1' '15911', '2017-10-27', 'testmysql1_en_log_1'
it registered 7 "Twitter / Error".... and it is the same number of days ... could it be causality but....
the last log registered is : {"min_position":"TWEET-911651435296825344-923700770066776065","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}
yeah,
{"min_position":"TWEET-911651435296825344-923700770066776065","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}
is the response when there's no more tweets for the query and should be the last response
It doesn't make sense because the time frame was from 2017-09-01 until 2017-10-27 and there are only results from 7 days, 2 from October (27 and 26) and 5 from September (27,26,25,24 and 23)
What about the rest of the days???? there are only result in 7 of the almost 60 days I asked for and I checked that there are tweets in , for example 24 and 23 of October and they don't appear in that query.....
the code was: import twint
c = twint.Config() c.Count = True
c.search_name ="testmysql1_en_log_1" c.Search = "elonmusk OR testla OR electri OR car OR apple OR volvo OR uber" c.Since = "2017-09-01" c.Until = "2017-10-27" c.Lang = "en" twint.run.Search(c)
and if I run it again, the most probably is there will be different number of tweets and in different days so there should be a problem somewhere when the query gives many many results and the time frame in bigger than two days because if you reduce it to a week... you don't get also the tweets from all the days. A workaround would be query only two days by two days like 2017-09-01 until 2017-09-03 and then from 2017-09-03 until 2017-09-05 and so on but.....
I think I found out where the problem is with very large queries. If you don't specify a Timedelta, by default is 30 days. Therefore the first bath in the time frame I specified is 2017-09-27 until 2017-10-27. the outcome I got is for example something like that: there are only results from 7 days, 2 from October (27 and 26) and 5 from September (27,26,25,24 and 23).
I tried to bypass the timedelta stuff to go over the whole time frame but then I only go 3 days from October .... The last thing I am trying is specify a timedelta = 2 days and it seems it is working fine, I think it will get the tweets from all the days in the timeframe.
the code I am referring to is in the search.py : I write some prints for debugging
async def main(self): if self.config.User_id is not None: self.config.Username = await get.Username(self.config.User_id)
if self.config.Since and self.config.Until:
_days = timedelta(days=int(self.config.Timedelta))
print("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$")#debug
print(_days)#debug
while self.d._since < self.d._until:
print( self.d._since)#debug
print(self.d._until)#debug
self.config.Since = str(self.d._until - _days)
print(self.config.Since)
self.config.Until = str(self.d._until)
print(self.config.Until)
if len(self.feed) > 0:
await self.tweets()
else:
self.d._until = self.d._until - _days
self.feed = [-1]
if get.Limit(self.config.Limit, self.count):
self.d._until = self.d._until - _days
self.feed = [-1]
else:
while True:
if len(self.feed) > 0:
await self.tweets()
else:
break
if get.Limit(self.config.Limit, self.count):
break
verbose.Count(self.config, self.count)
I will share the results when the query will finish and I will do some more test....as the queries are very large it will take some hours to get them
Do you know if it has been able to solve something realted to this problem an that of #106?
I don't know.
what I can say is using two days deltatime you can get tweets from all the time frame. In my case I go almost 1.700.000 tweets from almost 60 days with out any problem. it's a really huge number.
What do you mean with deltatime? Could you give me an example? please
there is a parameter in twint called timedelta
-t, --timedelta | Time interval for every request
by default is 30 days and when you ask twitter for tweets during an specific time, it request in 30 days batches. it seems that when the query is too large and with the default 20 days value, you don't get all the data properly. so...if you use --t 2 you request data using 2 days batches so in my case it worked very well.
you could try and let's see what happens.
I tried to do it as you said, but nothing. I keep getting 3110 tweets. It does not let me get older tweets.
I tried the comand you expose in #106 but with count and timedelta parameters:
python twint.py -u malaga -o file.csv --csv --count -t 2
and I stopped it manually when it got almost 7K tweets
are you sure you are using the last version?
Sorry, I explained myself wrongly. You are absolutely right, with the way you have previously commented it works perfectly.
However, it does not work when the command --profile-full
is added. With this added command, to get retweets too, it does not exceed 3000 tweets.
I will check it and I will share the outcome....let's see how it works for me
Closing this since the new --resume
option mitigates this.
It's on the dev branch right now, I'll merge with the master once the dev branch seems stable
Issue Template
Please use this template!
Initial Check
Command Ran
it gives only returns 25831 tweets from only these not consecutive days 2017-10-27, 2017-10-26 and 2017-09-27
it makes me think there is a problem with how the system manages the dates.
if I use the same parameters but changing the since and until parameters by: c.Since = "2017-10-23" c.Until = "2017-10-24"
then I get 81557 tweets from 2017-10-24 and 2017-10-23 Why did not appear in the previous search? this time frame is included in the previous one
and for doing it more weird... if I use the first time frame but searching for a not common stuff
I get 670 tweets from all these days: '2017-08-28' '2017-08-29' '2017-08-30' '2017-08-31' '2017-09-01' '2017-09-02' '2017-09-03' '2017-09-04' '2017-09-05' '2017-09-06' '2017-09-07' '2017-09-08' '2017-09-09' '2017-09-10' '2017-09-11' '2017-09-12' '2017-09-13' '2017-09-14' '2017-09-15' '2017-09-16' '2017-09-17' '2017-09-18' '2017-09-19' '2017-09-20' '2017-09-21' '2017-09-22' '2017-09-23' '2017-09-24' '2017-09-25' '2017-09-26' '2017-09-27' '2017-09-28' '2017-09-29' '2017-09-30' '2017-10-01' '2017-10-02' '2017-10-03' '2017-10-04' '2017-10-05' '2017-10-06' '2017-10-07' '2017-10-08' '2017-10-09' '2017-10-10' '2017-10-11' '2017-10-12' '2017-10-13' '2017-10-14' '2017-10-15' '2017-10-16' '2017-10-17' '2017-10-18' '2017-10-19' '2017-10-20' '2017-10-21' '2017-10-22' '2017-10-23' '2017-10-24' '2017-10-25' '2017-10-26' '2017-10-27'
Description of Issue
Environment Details