twintproject / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.
MIT License
15.75k stars 2.72k forks source link

Since and until issue (Tweets from before since date) #320

Closed arranjdavis closed 5 years ago

arranjdavis commented 5 years ago

Issue Template

This is my first time doing this, so apologies for any mistakes in the way I am reporting this and thanks in advance for the help!

Initial Check

Command Ran

scrape(terms = i,since=months[x][0],until=months[x][1],output=SAVE+str(i)+'_month_'+str(x+1)+'.csv')

with:

def scrape(terms,since,until,output): c=twint.Config() c.Search=terms c.Since=since c.Until=until c.Output=output c.Print=True c.Store_csv=True c.Limit=None c.Lang='en' twint.run.Search(c)

months[x][0] = '2016-12-01 months[x][1] = '2017-01-01

Description of Issue

I am trying to scrape tweets for a search term ('Pepsi') since '2016-12-01' until '2017-01-01', however, in the output files there are tweets from the every day until (at least, I stopped the script here) '2016-11-13' (the Tweets begin at '2016-12-31' in the output file). Then it moves on to the next period ('2017-01-01', '2017-02-01'), as it is supposed to. Not sure why! I've read in other resolved issues (https://github.com/twintproject/twint/issues/66) that Twitter search will sometimes return dates outside of the since until window, but this seemed to suggest they would only be a day or so outside the window. Either way, its not a huge issue, I can clean the data up in the .csv, but I thought I would let you know! Thanks.

Environment Details

Running in Anaconda on on Mac OS High Sierra (version 10.13.6)

pielco11 commented 5 years ago

I tried and config.Since seems to be broken, thank you for reporting

varunu28 commented 5 years ago

@pielco11 I would like to help on this issue. Let me know if I can start working on it.

pielco11 commented 5 years ago

@varunu28 you can start whenever you want, thank you for your help!

arranjdavis commented 5 years ago

I tried and config.Since seems to be broken, thank you for reporting

@pielco11 Yes, of course, thank you for twint! It is a great tool.

cbjrobertson commented 5 years ago

Any idea what's happening with this? Has twitter changed something? Thanks for all your work.

pielco11 commented 5 years ago

@cbjrobertson I did not check what's going on, yet.

cbjrobertson commented 5 years ago

Thanks:) This project quickly eclipsed my ability to help much with it, but would be happy to contribute if I could.

pielco11 commented 5 years ago

@cbjrobertson you could print the url of a simple request, then try the same query via Twitter Advanced Search and look for differences... it's like debugging, but no need for extra tools

cbjrobertson commented 5 years ago

@pielco11 -- sorry, but I'm a little confused. I printed a dummy search url by running with twint.Config.Debug = True, but when entered into twitter it just download a json file... File appears to contain a bunch of html of tweet objects seemingly. Is this the expected behaviour?

pielco11 commented 5 years ago

@cbjrobertson that's correct https://github.com/twintproject/twint/blob/e3c28aae6e7a7947d0aabb31854875aa942ba780/twint/feed.py#L40-L46

arranjdavis commented 5 years ago

Hi all (@pielco11, @cbjrobertson),

Thanks for chasing this up. I've found a workaround for this, which is to search the date span (e.g., July 1 to July 31), then get missing dates from the outputted tweets (e.g., July 1 to July 13), then search the missing dates as since and until dates, and then continue this process until the entire since - until date span is covered.

Obviously, that is not a long term solution. @pielco11, I'll have a go doing what you suggested to @cbjrobertson this week, and I'll get back to you if I make any progress. But, like him, this might be past my abilities!

pielco11 commented 5 years ago

368

cbjrobertson commented 5 years ago

@arranjdavis I tried testing some different search terms and couldn't replicate your issue. Can you run one of the search terms that's causing your issue with c.Debug = True then use the following function to extract the tweets & ids, and check against twitter advanced search response for the same query?

#dependencies 
import pandas as pd
import json
from bs4 import BeautifulSoup
import re

def json_process(path):
    with open(path,'r') as handle:
        req = json.load(handle)

    soup = BeautifulSoup(req['items_html'],features='lxml')
    tweets = soup.find_all("div",{'data-component-context':'tweet'})
    texts = soup.find_all('p',class_ = re.compile('^TweetTextSize'))
    d = {'text':[],
         'id':[]}
    for tweet in tweets:
        d['id'] += [tweet.get('data-tweet-id')]
    for text in texts:
        d['text'] += [text.get_text()]
    df = pd.DataFrame(d)

    return df 

@pielco11 am aware this replicates your Json function. I'd forgotten you'd provided it and wrote a roughly similar thing 🤷‍♂️

This function takes a path either to twint-last-request.log or to the json that downloads when you enter any of the urls in twint-request_urls.log to twitter.

arranjdavis commented 5 years ago

Okay so a few things to note:

  1. I am able to replicate the bug, as in, if I search 'Facebook' with since='2016-12-06' and until='2016-12-07 the script stops at the same tweet as when I search 'Facebook' with since='2016-12-02' and until='2016-12-07.

  2. The script dies without any error messages. The twint-last-request.log file contains one line: {"min_position":"thGAVUV0VFVBaCgLHxg6W5sBYWgICx0fP1wLAWEjUAFQAlAAA=","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}

  3. The information for the last tweet (the last outputted tweet) is:

Tweet id: 806270321531621376 Date: 2016-12-06 Time: 22:53:19,GMT User ID: 594256947 Username: cherylaskswhy Tweet: "I posted a new photo to Facebook http://fb.me/8l2l68YER "

Okay, that is all I can do today, but I will pick this up tomorrow, where I will follow @cbjrobertson suggestion and check my results against Twitter's advanced search output.

cbjrobertson commented 5 years ago

I have tried to replicate with a smaller search criteria (by limiting the user to the one from the tweet mentioned above (script below), and it collects all the tweets on the days specified (more tweets actually than using the same criteria in the twitter advanced search function). Maybe it is something to do with really high volumes of tweets.

import twint
c = twint.Config()
c.Until = '2016-12-07'
c.Since = '2016-12-06'
c.Search = 'Facebook'
c.Username = 'cherylaskswhy'
c.Debug = True
twint.run.Search(c)
arranjdavis commented 5 years ago

@cbjrobertson yeah, high tweet volumes could be a possibility, I also wonder if it is the next tweet what is causing the script to stop? Problem is that there is no way of telling what that tweet is (that I can think of, at least).

Another thing to note is that when I do the same search on the Twitter Advanced Search website (https://twitter.com/search?l=en&q=Facebook%20since%3A2016-12-06%20until%3A2016-12-07&src=typd&lang=en-gb) I can't even find the tweet I mentioned above (tweet id: 806270321531621376), and I am pretty sure I am not missing it.

Next step is to get a json file to pass to your function, @cbjrobertson, but the problem is that all of the urls in twint-request_urls.log are the same for me. They are all: https://twitter.com/i/search/timeline, and link to 'File not found' error.

My Python version is 3.6.2, and I updated twint today with pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

cbjrobertson commented 5 years ago

@pielco11

w.r.t. -- "Next step is to get a json file to pass to your function, @cbjrobertson, but the problem is that all of the urls in twint-request_urls.log are the same for me. They are all: https://twitter.com/i/search/timeline, and link to 'File not found' error."

The same thing is happening to me. It makes it pretty hard to debug.

Reproducible script:

pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint
import twint
c = twint.Config()
c.Until = '2016-12-07'
c.Since = '2016-12-06'
c.Search = 'Facebook'
c.Username = 'cherylaskswhy'
c.Debug = True
twint.run.Search(c)

outcome

twint-request_urls.log looks like:

http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline
http://twitter.com/i/search/timeline

desired outcome

twint-request_urls.log should be comprised of entries which look (something like) this:

https://twitter.com/i/search/timeline?f=tweets&vertical=default&l=en&lang=en&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&qf=off&max_position=thGAVUV0VFVBaAgKn1l-806128531029696512&q=from%3Acherylaskswhy%20Facebook%20since%3A2016-12-04%20until%3A2016-12-07

FWI, this does not appear to be a bug in twint 1.1.4.3. When I run in a env with that installed, I get full request urls. (Though that version suffers from #249).

This seems like a seperate bug. Would you like me to submit another issue?

pielco11 commented 5 years ago

Not so good and quite unexpected. I'm going to fix this later

Thanks for reporting! -------- Messaggio originale -------- On 22 Mar 2019, 21:00, cbjrobertson ha scritto:

@pielco11

w.r.t. -- "Next step is to get a json file to pass to your function, @cbjrobertson, but the problem is that all of the urls in twint-request_urls.log are the same for me. They are all: https://twitter.com/i/search/timeline, and link to 'File not found' error."

The same thing is happening to me. It makes it pretty hard to debug.

Reproducible script:

pip3 install --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

import twint c = twint.Config() c.Until = '2016-12-07' c.Since = '2016-12-06' c.Search = 'Facebook' c.Username = 'cherylaskswhy' c.Debug = True twint.run.Search(c)

outcome

twint-request_urls.log looks like:

http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline http://twitter.com/i/search/timeline

desired outcome

twint-request_urls.log should be comprised of entries which look (something like) this:

https://twitter.com/i/search/timeline?f=tweets&vertical=default&l=en&lang=en&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&qf=off&max_position=thGAVUV0VFVBaAgKn1l-806128531029696512&q=from%3Acherylaskswhy%20Facebook%20since%3A2016-12-04%20until%3A2016-12-07

FWI, this does not appear to be a bug in twint 1.1.4.3. When I run in a env with that installed, I get full request urls. (Though that version suffers from #249).

This seems like a seperate bug. Would you like me to submit another issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pielco11 commented 5 years ago

When we changed the way Twint creates the requests for Twitter, we changed the query structure as well. Now it's "sanitized" and "human readable", so it now looks as expected and wanted @cbjrobertson

cbjrobertson commented 5 years ago

@pielco11 ok, but how do you use it to debug then? It doesn’t download json files as before, and all the requests look identical. What is the rationale behind this change?

pielco11 commented 5 years ago

Previously it was returning only the baseurl (http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.

Requests might seem the same, but they are not:

immagine

Plus you are still able to download jsons: http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy

immagine

cbjrobertson commented 5 years ago

I’m confused—I am telling you that it is presently only returning the base url on my and @arranjdavis ‘s system...

On 23 Mar 2019, at 14:24, Francesco Poldi notifications@github.com wrote:

Previously it was returning only the baseurl (http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.

Requests might seem the same, but they are not:

Plus you are still able to download jsons: http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

pielco11 commented 5 years ago

I guess I'm confused too. I pushed a change to save the full url as showed you in the previous comment. So you have to run pip+git or pull the repo to get the patch -------- Messaggio originale -------- On 23 Mar 2019, 20:55, cbjrobertson ha scritto:

I’m confused—I am telling you that it is presently only returning the base url on my and @arranjdavis ‘s system...

On 23 Mar 2019, at 14:24, Francesco Poldi notifications@github.com wrote:

Previously it was returning only the baseurl (http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.

Requests might seem the same, but they are not:

Plus you are still able to download jsons: http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

arranjdavis commented 5 years ago

I ran pip3 install twint --upgrade and I am now getting the full urls.

Thanks @pielco11 for the update, and to both for your help with this. I will now use the function that @cbjrobertson wrote to try to find the tweets that are causing the script to stop before the correct since date is reached.

arranjdavis commented 5 years ago

Okay, so I downloaded the json files from the last and second to last urls in twint-request_urls.log

Using the function json_process() from @cbjrobertson I get the following output for the json from the second to last url in twint-request_urls.log:

Screen Shot 2019-03-25 at 10 26 34 AM

As you will see, the last tweet (id: 806270321531621376) is the last tweet outputed by twint when searching 'Facebook' with since='2016-12-06' and until='2016-12-07.

The json from the last url in twint-request_urls.log contains only the following text (no 'tweet objects', as @cbjrobertson referred to them):

{"min_position":"thGAVUV0VFVBaCgLHxg6W5sBYWgICx0fP1wLAWEjUAFQAlAAA=","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}

This is the same text that is in the twint-last-request.log, and of course leads to nothing for the output of json_process.

Seems to me, then, that the error is on Twitter's end? But maybe I am missing something; I will let you two diagnose...

cbjrobertson commented 5 years ago

@arranjdavis does it fail after the same number of tweets, on each of these high volume requests you’re making?

On 25 Mar 2019, at 10:23, Arran Davis notifications@github.com wrote:

Okay, so I downloaded the json files from the last and second to last urls in twint-request_urls.log

Using the function json_process() from @cbjrobertson I get the following output for the json from the second to last url in twint-request_urls.log:

text id 0 I posted a new video to Facebook http://fb.me/... 806270367459373056 1 My parents have facebook instead of me?..... l... 806270366134009856 2 Would three of my Facebook friends please copy... 806270363982241793 3 The hunt is on in Apex Legends Season 1. New L... 1107790127907004417 4 I posted a new photo to Facebook http://fb.me/... 806270363478814720 5 I don't go on Facebook exposing y'all for the ... 806270362094813184 6 I posted a new photo to Facebook http://fb.me/... 806270358957473792 7 I posted a new photo to Facebook http://fb.me/... 806270356755533824 8 Thanks for posting those puppies on Facebook n... 806270349415481344 9 I posted a new video to Facebook http://fb.me/... 806270346798268416 10 Y'all have 6 hours to wish me a happy birthday... 806270344805908480 11 I wouldn't have said anything but these commen... 806270342045913088 12 Tuesday Empowerment Night will be held on Face... 806270339516923909 13 Is @instagram Facebook now? Or is it Youtube? ... 806270337621102594 14 and after looking at Facebook, I was right. 806270335947509760 15 I posted 41 photos on Facebook http://fb.me/15... 806270334114656256 16 I posted 2 photos on Facebook in the album "[H... 806270328636928000 17 facebook memories are embarrassing rofl 806270324878622720 18 I posted a new photo to Facebook http://fb.me/... 806270321531621376

As you will see, the last tweet (id: 806270321531621376) is the last tweet outputed by twint when searching 'Facebook' with since='2016-12-06' and until='2016-12-07.

The json from the last url in twint-request_urls.log contains only the following text (no 'tweet objects', as @cbjrobertson referred to them):

{"min_position":"thGAVUV0VFVBaCgLHxg6W5sBYWgICx0fP1wLAWEjUAFQAlAAA=","has_more_items":false,"items_html":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n","new_latent_count":0,"focused_refresh_interval":30000}

This is the same text that is in the twint-last-request.log, and of course leads to nothing for the output of the json_process.

Seems to me, then, that the error is on Twitter's end? But maybe I am missing something; I will let you two diagnose...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

arranjdavis commented 5 years ago

@cbjrobertson, that doesn't appear to be the case.

since='2016-12-02' and 'until='2016-12-07': total tweets is 6929 since='2016-12-02' and 'until='2016-12-06': total tweets is 27316 since='2016-12-02' and 'until='2016-12-04': total tweets is 1148 since='2016-12-06' and 'until='2016-12-07': total tweets is 55371

cbjrobertson commented 5 years ago

NVM. I misunderstood. Didn't realize you had pushed a change. Cheers!

Cole

On Sat, Mar 23, 2019 at 11:22 PM Francesco Poldi notifications@github.com wrote:

I guess I'm confused too. I pushed a change to save the full url as showed you in the previous comment. So you have to run pip+git or pull the repo to get the patch -------- Messaggio originale -------- On 23 Mar 2019, 20:55, cbjrobertson ha scritto:

I’m confused—I am telling you that it is presently only returning the base url on my and @arranjdavis ‘s system...

On 23 Mar 2019, at 14:24, Francesco Poldi notifications@github.com wrote:

Previously it was returning only the baseurl ( http://twitter.com/i/search/timeline) and not the composed query. Now, it creates the full query and stores it in twint-request_urls.log.

Requests might seem the same, but they are not:

Plus you are still able to download jsons:

http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwLblh9Ld4h4WgoC1uceioOUeEjUAFQAlAAA=&reset_error_state=false&q=%20from%3Anoneprivacy

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/twintproject/twint/issues/320#issuecomment-475913261, or mute the thread https://github.com/notifications/unsubscribe-auth/AVVeHibUXMy8ctcIeIfgJFDMUeqq6bwaks5vZrckgaJpZM4Zd9OL .

cbjrobertson commented 5 years ago

@cbjrobertson, that doesn't appear to be the case.

since='2016-12-02' and 'until='2016-12-07': total tweets is 6929 since='2016-12-02' and 'until='2016-12-06': total tweets is 27316 since='2016-12-02' and 'until='2016-12-04': total tweets is 1148 since='2016-12-06' and 'until='2016-12-07': total tweets is 55371

And when your run twitter advanced search with the parameters since='2016-12-02' and 'until='2016-12-04' (just because these have the lowest volume) does it return the same results and stop in the same place?

arranjdavis commented 5 years ago

So, I've check all four instances reported above and the Twitter advanced search seems to return the entire date span. Lots of scrolling, but all definitely go past where twint stopped, and the since='2016-12-02' and 'until='2016-12-04' definitely goes to the end (i.e., 12:00am on 2016-12-02).

Other interesting observations:

  1. For each span, I can't find the last outputted tweet from twint in the Twitter advanced search output (assuming both are returning UTC times - twint definitely is).

  2. All of the final tweets outputted by twint (that I've checked) contained links either to Facebook or YouTube, but this could just be chance - I'd say more than half of returned tweets contained links (most to Facebook and YouTube). I don't think this is the issue, but thought I would mention.

So, point 1 seems to be more relevant.

This is how I was doing the Twitter advanced search:

Screen Shot 2019-03-25 at 1 45 30 PM
cbjrobertson commented 5 years ago

@pielco11 -- given @arranjdavis's above comment, do you have any idea what might be causing twitter to stop paging when twint is accessing it, and not on advanced search queries? I fear I have taken this issue as far as I can...

pielco11 commented 5 years ago

I'll take a look as soon as I'll be back at my desk

Thank you both for all your efforts! Really appreciated! -------- Messaggio originale -------- On 25 Mar 2019, 16:30, cbjrobertson ha scritto:

pielco11 -- given @arranjdavis's above comment, do you have any idea what might be causing twitter to stop paging when twint is accessing it, and not on advanced search queries? I fear I have taken this issue as far as I can...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

arranjdavis commented 5 years ago

My pleasure! You've built a great tool that I use a lot, so it is nice to be able to contribute. Just let me know if I can do anything else!

pielco11 commented 5 years ago

So I searched with the query

import twint

c = twint.Config()
c.Search = "Facebook"
c.Lang = "en"
c.Store_csv = True
c.Output = "tweets.csv"
c.Since = "2016-12-02"
c.Until = "2016-12-04"
c.Debug = True

twint.run.Search(c)

But got only 1138 tweets image

The last request url is http://twitter.com/i/search/timeline?f=tweets&vertical=default&src=unkn&include_available_features=1&include_entities=1&max_position=thGAVUV0VFVBaAwKih34rRrBYWgMCj2djE0qwWEjUAFQAlAAA=&reset_error_state=false&l=en&lang=en&q=%20Facebook%20since%3A2016-12-02%20until%3A2016-12-04

And stopped here image

I tried adding timeouts to see if it was Twitter "blocking" requests, found out that the result does not change. So I decided to scroll manually. Found out that Twitter does not return more tweets

image

And the last searched tweet is the same last scraped one. I kept trying scrolling down, arrows, page down, nothing worked. So I thought "maybe Twitter is blocking me, let me do a new search with other words" and new tweets did come.

So I do not think that Twitter is blocking our IP, and even not the query since I can scrape again the same tweets.

I tried with config.Resume with the last tweet id and got nothing. Is Resume feature broken? No, it's not because I placed one of the latest tweets id and got the other tweets

image

So I guess that Twitter is not playing well @arranjdavis @cbjrobertson

arranjdavis commented 5 years ago

@pielco11 thanks for looking into that more. Strange that Twitter is only returning such a small number of tweets (at seemingly sporadic dates). Any idea as to why? Do you think this is intentional?

Let me know if there is anything further I can do to help!

pielco11 commented 5 years ago

I do not have the Twitter code so I do not know why it's playing in this way

Maybe their scoring function returns tweets that does not fully respect the request

Ali-khavanin commented 4 years ago

hi @pielco11 first, I wanted to thank you for this awesome module. and second I wanted to use since in my code something like c.since = '218-10-22 12:00:00' and I keep getting this error : return datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S").strftime('%s') ValueError: Invalid format string

what should I do?

pielco11 commented 4 years ago

c.since = '218-10-22 12:00:00'

@Ali-khavanin did you mean 2018?

Ali-khavanin commented 4 years ago

c.since = '218-10-22 12:00:00'

@Ali-khavanin did you mean 2018?

Oh yeah . Sorry bad typing 😬 c.Since = '2018-10-22 12:00:00' It gives that error.

pielco11 commented 4 years ago

Your issue seems to be related to #597 , is it correct?