Closed arranjdavis closed 5 years ago
I tried and config.Since
seems to be broken, thank you for reporting
@pielco11 I would like to help on this issue. Let me know if I can start working on it.
@varunu28 you can start whenever you want, thank you for your help!
I tried and
config.Since
seems to be broken, thank you for reporting
@pielco11 Yes, of course, thank you for twint! It is a great tool.
Any idea what's happening with this? Has twitter changed something? Thanks for all your work.
@cbjrobertson I did not check what's going on, yet.
Thanks:) This project quickly eclipsed my ability to help much with it, but would be happy to contribute if I could.
@cbjrobertson you could print the url of a simple request, then try the same query via Twitter Advanced Search and look for differences... it's like debugging, but no need for extra tools
@pielco11 -- sorry, but I'm a little confused. I printed a dummy search url by running with twint.Config.Debug = True
, but when entered into twitter it just download a json
file... File appears to contain a bunch of html
of tweet objects seemingly. Is this the expected behaviour?
@cbjrobertson that's correct https://github.com/twintproject/twint/blob/e3c28aae6e7a7947d0aabb31854875aa942ba780/twint/feed.py#L40-L46
Hi all (@pielco11, @cbjrobertson),
Thanks for chasing this up. I've found a workaround for this, which is to search the date span (e.g., July 1 to July 31), then get missing dates from the outputted tweets (e.g., July 1 to July 13), then search the missing dates as since and until dates, and then continue this process until the entire since - until date span is covered.
Obviously, that is not a long term solution. @pielco11, I'll have a go doing what you suggested to @cbjrobertson this week, and I'll get back to you if I make any progress. But, like him, this might be past my abilities!
@arranjdavis I tried testing some different search terms and couldn't replicate your issue. Can you run one of the search terms that's causing your issue with c.Debug = True
then use the following function to extract the tweets & ids, and check against twitter advanced search response for the same query?
#dependencies
import pandas as pd
import json
from bs4 import BeautifulSoup
import re
def json_process(path):
with open(path,'r') as handle:
req = json.load(handle)
soup = BeautifulSoup(req['items_html'],features='lxml')
tweets = soup.find_all("div",{'data-component-context':'tweet'})
texts = soup.find_all('p',class_ = re.compile('^TweetTextSize'))
d = {'text':[],
'id':[]}
for tweet in tweets:
d['id'] += [tweet.get('data-tweet-id')]
for text in texts:
d['text'] += [text.get_text()]
df = pd.DataFrame(d)
return df
@pielco11 am aware this replicates your Json
function. I'd forgotten you'd provided it and wrote a roughly similar thing 🤷♂️
This function takes a path
either to twint-last-request.log
or to the json
that downloads when you enter any of the urls
in twint-request_urls.log
to twitter.
Okay so a few things to note:
I am able to replicate the bug, as in, if I search 'Facebook' with since='2016-12-06'
and until='2016-12-07
the script stops at the same tweet as when I search 'Facebook' with since='2016-12-02'
and until='2016-12-07
.
The script dies without any error messages. The twint-last-request.log
file contains one line: {"min_position":"thGAVUV0VFVBaCgLHxg6W5sBYWgICx0fP1wLAWEjUAFQAlAAA=","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}
The information for the last tweet (the last outputted tweet) is:
Tweet id: 806270321531621376 Date: 2016-12-06 Time: 22:53:19,GMT User ID: 594256947 Username: cherylaskswhy Tweet: "I posted a new photo to Facebook http://fb.me/8l2l68YER "
Okay, that is all I can do today, but I will pick this up tomorrow, where I will follow @cbjrobertson suggestion and check my results against Twitter's advanced search output.
I have tried to replicate with a smaller search criteria (by limiting the user to the one from the tweet mentioned above (script below), and it collects all the tweets on the days specified (more tweets actually than using the same criteria in the twitter advanced search function). Maybe it is something to do with really high volumes of tweets.
import twint
c = twint.Config()
c.Until = '2016-12-07'
c.Since = '2016-12-06'
c.Search = 'Facebook'
c.Username = 'cherylaskswhy'
c.Debug = True
twint.run.Search(c)
@cbjrobertson yeah, high tweet volumes could be a possibility, I also wonder if it is the next tweet what is causing the script to stop? Problem is that there is no way of telling what that tweet is (that I can think of, at least).
Another thing to note is that when I do the same search on the Twitter Advanced Search website (https://twitter.com/search?l=en&q=Facebook%20since%3A2016-12-06%20until%3A2016-12-07&src=typd&lang=en-gb) I can't even find the tweet I mentioned above (tweet id: 806270321531621376), and I am pretty sure I am not missing it.
Next step is to get a json file to pass to your function, @cbjrobertson, but the problem is that all of the urls in twint-request_urls.log
are the same for me. They are all: https://twitter.com/i/search/timeline, and link to 'File not found' error.
My Python version is 3.6.2, and I updated twint today with pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
@pielco11
w.r.t. -- "Next step is to get a json file to pass to your function, @cbjrobertson, but the problem is that all of the urls in twint-request_urls.log are the same for me. They are all: https://twitter.com/i/search/timeline, and link to 'File not found' error."
The same thing is happening to me. It makes it pretty hard to debug.
Reproducible script:
pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
import twint
c = twint.Config()
c.Until = '2016-12-07'
c.Since = '2016-12-06'
c.Search = 'Facebook'
c.Username = 'cherylaskswhy'
c.Debug = True
twint.run.Search(c)
twint-request_urls.log
looks like:
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
twint-request_urls.log
should be comprised of entries which look (something like) this:
https://twitter.com/i/search/timeline?f=tweets&vertical=default&l=en&lang=en&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&qf=off&max_position=thGAVUV0VFVBaAgKn1l-806128531029696512&q=from%3Acherylaskswhy%20Facebook%20since%3A2016-12-04%20until%3A2016-12-07
FWI, this does not appear to be a bug in twint 1.1.4.3
. When I run in a env with that installed, I get full request urls. (Though that version suffers from #249).
This seems like a seperate bug. Would you like me to submit another issue?
Not so good and quite unexpected. I'm going to fix this later
Thanks for reporting! -------- Messaggio originale -------- On 22 Mar 2019, 21:00, cbjrobertson ha scritto:
w.r.t. -- "Next step is to get a json file to pass to your function, @cbjrobertson, but the problem is that all of the urls in twint-request_urls.log are the same for me. They are all: https://twitter.com/i/search/timeline, and link to 'File not found' error."
The same thing is happening to me. It makes it pretty hard to debug.
Reproducible script:
pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
import twint c = twint.Config() c.Until = '2016-12-07' c.Since = '2016-12-06' c.Search = 'Facebook' c.Username = 'cherylaskswhy' c.Debug = True twint.run.Search(c)
outcome
twint-request_urls.log looks like:
http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline
desired outcome
twint-request_urls.log should be comprised of entries which look (something like) this:
FWI, this does not appear to be a bug in twint 1.1.4.3. When I run in a env with that installed, I get full request urls. (Though that version suffers from #249).
This seems like a seperate bug. Would you like me to submit another issue?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
When we changed the way Twint creates the requests for Twitter, we changed the query structure as well. Now it's "sanitized" and "human readable", so it now looks as expected and wanted @cbjrobertson
@pielco11 ok, but how do you use it to debug then? It doesn’t download json files as before, and all the requests look identical. What is the rationale behind this change?
Previously it was returning only the baseurl (http://twitter.com/i/search/timeline
) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log
.
Requests might seem the same, but they are not:
Plus you are still able to download jsons:
http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy
I’m confused—I am telling you that it is presently only returning the base url on my and @arranjdavis ‘s system...
On 23 Mar 2019, at 14:24, Francesco Poldi notifications@github.com wrote:
Previously it was returning only the baseurl (http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.
Requests might seem the same, but they are not:
Plus you are still able to download jsons: http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I guess I'm confused too. I pushed a change to save the full url as showed you in the previous comment. So you have to run pip+git or pull the repo to get the patch -------- Messaggio originale -------- On 23 Mar 2019, 20:55, cbjrobertson ha scritto:
I’m confused—I am telling you that it is presently only returning the base url on my and @arranjdavis ‘s system...
On 23 Mar 2019, at 14:24, Francesco Poldi notifications@github.com wrote:
Previously it was returning only the baseurl (http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.
Requests might seem the same, but they are not:
Plus you are still able to download jsons: http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I ran pip3 install twint --upgrade
and I am now getting the full urls.
Thanks @pielco11 for the update, and to both for your help with this. I will now use the function that @cbjrobertson wrote to try to find the tweets that are causing the script to stop before the correct since
date is reached.
Okay, so I downloaded the json files from the last and second to last urls in twint-request_urls.log
Using the function json_process()
from @cbjrobertson I get the following output for the json from the second to last url in twint-request_urls.log
:
As you will see, the last tweet (id: 806270321531621376) is the last tweet outputed by twint when searching 'Facebook' with since='2016-12-06'
and until='2016-12-07
.
The json from the last url in twint-request_urls.log
contains only the following text (no 'tweet objects', as @cbjrobertson referred to them):
{"min_position":"thGAVUV0VFVBaCgLHxg6W5sBYWgICx0fP1wLAWEjUAFQAlAAA=","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}
This is the same text that is in the twint-last-request.log
, and of course leads to nothing for the output of json_process
.
Seems to me, then, that the error is on Twitter's end? But maybe I am missing something; I will let you two diagnose...
@arranjdavis does it fail after the same number of tweets, on each of these high volume requests you’re making?
On 25 Mar 2019, at 10:23, Arran Davis notifications@github.com wrote:
Okay, so I downloaded the json files from the last and second to last urls in twint-request_urls.log
Using the function json_process() from @cbjrobertson I get the following output for the json from the second to last url in twint-request_urls.log:
text id 0 I posted a new video to Facebook http://fb.me/... 806270367459373056 1 My parents have facebook instead of me?..... l... 806270366134009856 2 Would three of my Facebook friends please copy... 806270363982241793 3 The hunt is on in Apex Legends Season 1. New L... 1107790127907004417 4 I posted a new photo to Facebook http://fb.me/... 806270363478814720 5 I don't go on Facebook exposing y'all for the ... 806270362094813184 6 I posted a new photo to Facebook http://fb.me/... 806270358957473792 7 I posted a new photo to Facebook http://fb.me/... 806270356755533824 8 Thanks for posting those puppies on Facebook n... 806270349415481344 9 I posted a new video to Facebook http://fb.me/... 806270346798268416 10 Y'all have 6 hours to wish me a happy birthday... 806270344805908480 11 I wouldn't have said anything but these commen... 806270342045913088 12 Tuesday Empowerment Night will be held on Face... 806270339516923909 13 Is @instagram Facebook now? Or is it Youtube? ... 806270337621102594 14 and after looking at Facebook, I was right. 806270335947509760 15 I posted 41 photos on Facebook http://fb.me/15... 806270334114656256 16 I posted 2 photos on Facebook in the album "[H... 806270328636928000 17 facebook memories are embarrassing rofl 806270324878622720 18 I posted a new photo to Facebook http://fb.me/... 806270321531621376
As you will see, the last tweet (id: 806270321531621376) is the last tweet outputed by twint when searching 'Facebook' with since='2016-12-06' and until='2016-12-07.
The json from the last url in twint-request_urls.log contains only the following text (no 'tweet objects', as @cbjrobertson referred to them):
{"min_position":"thGAVUV0VFVBaCgLHxg6W5sBYWgICx0fP1wLAWEjUAFQAlAAA=","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}
This is the same text that is in the twint-last-request.log, and of course leads to nothing for the output of the json_process.
Seems to me, then, that the error is on Twitter's end? But maybe I am missing something; I will let you two diagnose...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@cbjrobertson, that doesn't appear to be the case.
since='2016-12-02'
and 'until='2016-12-07'
: total tweets is 6929
since='2016-12-02'
and 'until='2016-12-06'
: total tweets is 27316
since='2016-12-02'
and 'until='2016-12-04'
: total tweets is 1148
since='2016-12-06'
and 'until='2016-12-07'
: total tweets is 55371
NVM. I misunderstood. Didn't realize you had pushed a change. Cheers!
Cole
On Sat, Mar 23, 2019 at 11:22 PM Francesco Poldi notifications@github.com wrote:
I guess I'm confused too. I pushed a change to save the full url as showed you in the previous comment. So you have to run pip+git or pull the repo to get the patch -------- Messaggio originale -------- On 23 Mar 2019, 20:55, cbjrobertson ha scritto:
I’m confused—I am telling you that it is presently only returning the base url on my and @arranjdavis ‘s system...
On 23 Mar 2019, at 14:24, Francesco Poldi notifications@github.com wrote:
Previously it was returning only the baseurl ( http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.
Requests might seem the same, but they are not:
Plus you are still able to download jsons:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/twintproject/twint/issues/320#issuecomment-475913261, or mute the thread https://github.com/notifications/unsubscribe-auth/AVVeHibUXMy8ctcIeIfgJFDMUeqq6bwaks5vZrckgaJpZM4Zd9OL .
@cbjrobertson, that doesn't appear to be the case.
since='2016-12-02'
and'until='2016-12-07'
: total tweets is 6929since='2016-12-02'
and'until='2016-12-06'
: total tweets is 27316since='2016-12-02'
and'until='2016-12-04'
: total tweets is 1148since='2016-12-06'
and'until='2016-12-07'
: total tweets is 55371
And when your run twitter advanced search with the parameters since='2016-12-02'
and 'until='2016-12-04'
(just because these have the lowest volume) does it return the same results and stop in the same place?
So, I've check all four instances reported above and the Twitter advanced search seems to return the entire date span. Lots of scrolling, but all definitely go past where twint stopped, and the since='2016-12-02'
and 'until='2016-12-04'
definitely goes to the end (i.e., 12:00am on 2016-12-02).
Other interesting observations:
For each span, I can't find the last outputted tweet from twint in the Twitter advanced search output (assuming both are returning UTC times - twint definitely is).
All of the final tweets outputted by twint (that I've checked) contained links either to Facebook or YouTube, but this could just be chance - I'd say more than half of returned tweets contained links (most to Facebook and YouTube). I don't think this is the issue, but thought I would mention.
So, point 1 seems to be more relevant.
This is how I was doing the Twitter advanced search:
@pielco11 -- given @arranjdavis's above comment, do you have any idea what might be causing twitter to stop paging when twint
is accessing it, and not on advanced search queries? I fear I have taken this issue as far as I can...
I'll take a look as soon as I'll be back at my desk
Thank you both for all your efforts! Really appreciated! -------- Messaggio originale -------- On 25 Mar 2019, 16:30, cbjrobertson ha scritto:
pielco11 -- given @arranjdavis's above comment, do you have any idea what might be causing twitter to stop paging when twint is accessing it, and not on advanced search queries? I fear I have taken this issue as far as I can...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
My pleasure! You've built a great tool that I use a lot, so it is nice to be able to contribute. Just let me know if I can do anything else!
So I searched with the query
import twint
c = twint.Config()
c.Search = "Facebook"
c.Lang = "en"
c.Store_csv = True
c.Output = "tweets.csv"
c.Since = "2016-12-02"
c.Until = "2016-12-04"
c.Debug = True
twint.run.Search(c)
But got only 1138 tweets
The last request url is http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwKih34rRrBYWgMCj2djE0qwWEjUAFQAlAAA=&reset_error_state=false&l=en&lang=en&q=%20Facebook%20since%3A2016-12-02%20until%3A2016-12-04
And stopped here
I tried adding timeouts to see if it was Twitter "blocking" requests, found out that the result does not change. So I decided to scroll manually. Found out that Twitter does not return more tweets
And the last searched tweet is the same last scraped one. I kept trying scrolling down, arrows, page down, nothing worked. So I thought "maybe Twitter is blocking me, let me do a new search with other words" and new tweets did come.
So I do not think that Twitter is blocking our IP, and even not the query since I can scrape again the same tweets.
I tried with config.Resume
with the last tweet id and got nothing. Is Resume
feature broken? No, it's not because I placed one of the latest tweets id and got the other tweets
So I guess that Twitter is not playing well @arranjdavis @cbjrobertson
@pielco11 thanks for looking into that more. Strange that Twitter is only returning such a small number of tweets (at seemingly sporadic dates). Any idea as to why? Do you think this is intentional?
Let me know if there is anything further I can do to help!
I do not have the Twitter code so I do not know why it's playing in this way
Maybe their scoring function returns tweets that does not fully respect the request
hi @pielco11 first, I wanted to thank you for this awesome module. and second I wanted to use since in my code something like c.since = '218-10-22 12:00:00' and I keep getting this error : return datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S").strftime('%s') ValueError: Invalid format string
what should I do?
c.since = '218-10-22 12:00:00'
@Ali-khavanin did you mean 2018
?
c.since = '218-10-22 12:00:00'
@Ali-khavanin did you mean
2018
?
Oh yeah . Sorry bad typing 😬 c.Since = '2018-10-22 12:00:00' It gives that error.
Your issue seems to be related to #597 , is it correct?
Issue Template
This is my first time doing this, so apologies for any mistakes in the way I am reporting this and thanks in advance for the help!
Initial Check
pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
;Command Ran
scrape(terms = i,since=months[x][0],until=months[x][1],output=SAVE+str(i)+'_month_'+str(x+1)+'.csv')
with:
def scrape(terms,since,until,output): c=twint.Config() c.Search=terms c.Since=since c.Until=until c.Output=output c.Print=True c.Store_csv=True c.Limit=None c.Lang='en' twint.run.Search(c)
months[x][0] = '2016-12-01
months[x][1] = '2017-01-01
Description of Issue
I am trying to scrape tweets for a search term ('Pepsi') since '2016-12-01' until '2017-01-01', however, in the output files there are tweets from the every day until (at least, I stopped the script here) '2016-11-13' (the Tweets begin at '2016-12-31' in the output file). Then it moves on to the next period ('2017-01-01', '2017-02-01'), as it is supposed to. Not sure why! I've read in other resolved issues (https://github.com/twintproject/twint/issues/66) that Twitter search will sometimes return dates outside of the
since
until
window, but this seemed to suggest they would only be a day or so outside the window. Either way, its not a huge issue, I can clean the data up in the .csv, but I thought I would let you know! Thanks.Environment Details