taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.41k stars 579 forks source link

Want to retrieve the tweets account/handle wise not keyword wise #120

Closed NileshJorwar closed 6 years ago

NileshJorwar commented 6 years ago

I want to retrieve all the tweets by account wise. Say @amazon has tweeted 27.8K tweets till date, how to retrieve all the tweets made by amazon rather than the tweets in which amazon keyword is included. However i tried using latest advanced search query option in INIT_URL as "https://twitter.com/search?l=&q=from%3A{q}&src=typd&lang=en" but could not find reload url for the same. But this option does not give me whole tweets as I need to modify the script tweet.py to retrieve the tweeter data using tags by BeautifulSoup in from_soup method.

lapp0 commented 6 years ago

make the query from:amazon, no need to mess with the INIT_URL.

NileshJorwar commented 6 years ago

url itself contains from or else i can also change the query to include "from", however thats not an issue i think. In case of reload url which fails as i changed the init_url to load the data from "https://twitter.com/search?f=tweets&vertical=default&q={q}&l={lang}" to "https://twitter.com/search?l=&q=from%3A{q}&src=typd&lang=en" . The new url is advanced query option for the tweeter which gives me tweets of particular tweeter handle but not all the tweets. Hence I need to have reload url in case first does not work.

lapp0 commented 6 years ago

Sorry, not really understanding what you're saying. Could you describe your goal and why you aren't just passing from:amazon as the query..

Regarding your comment here https://github.com/taspinar/twitterscraper/issues/118#issuecomment-398834780 are you sure 200,000 tweets exist? Are you sure you aren't being cut off prematurely because there aren't 200,000 tweets that match that search in existence?

NileshJorwar commented 6 years ago

@lapp0 I tried as suggested to include the "from in query as from:q" . Though it worked, but it did not include the tweets from the given user handle as I tried to retrieve all the tweets made by amazon tweeter handle but i received only 780 tweets out of 50K tweets. You can also see in the attached below. I want to get all the tweets made by amazon.

image

lapp0 commented 6 years ago

Indeed, that is strange. Just to confirm, you are running this command?

twitterscraper from:amazon --output=amazon.json

NileshJorwar commented 6 years ago

Actually I modified the script i.e. main.py to include the parameters through the program not command line arguments as below: And rest of the python scripts are same except I made few changes as suggested in latest fixes in query.py to remove fake useragent and modified the URL as suggested by you @lapp0 to include the "from:" in query. The changes are marked in bold. Please look into them and suggest.

-- coding: utf-8 --

""" Created on Thu May 24 11:53:26 2018

@author: Nilesh Jorwar """

import sys import json import logging import collections import datetime as dt from os.path import isfile from twitterscraper.query import query_tweets import csv

class JSONEncoder(json.JSONEncoder): def default(self, obj): if hasattr(obj, 'json'): return obj.json() elif isinstance(obj, collections.Iterable): return list(obj) elif isinstance(obj, dt.datetime): return obj.isoformat() elif hasattr(obj, 'getitem') and hasattr(obj, 'keys'): return dict(obj) elif hasattr(obj, 'dict'): return {member: getattr(obj, member) for member in dir(obj) if not member.startswith('_') and not hasattr(getattr(obj, member), 'call')}

    return json.JSONEncoder.default(self, obj)

def valid_date(s): try: return dt.datetime.strptime(s, "%Y-%m-%d").date() except ValueError: msg = "Not a valid date: '{0}'.".format(s) print(msg)

def main(): logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) try: **with open('coname_twitter_account_users.csv', 'r',newline='') as myFile: readInput=csv.reader(myFile) for i,company in enumerate(readInput): if i > 0 and company[1]: print(company[1]) output=company[1]+'.json'
if isfile(output): logging.error("Output file already exists! Aborting.")

continue

                     sys.exit(-1)
                 query1=company[1]
                 #query1='amazon'
                 limit1=100000
                 begindateString = '2006-03-21'
                 begindate1=valid_date(begindateString)
                 enddate1=dt.date.today()
                 poolsize1=20
                 lang1='en'

                 tweets = query_tweets(query = query1, limit = limit1, 
                                       begindate = begindate1, enddate = enddate1, 
                                       poolsize = poolsize1, lang = lang1)

                 if tweets:
                     with open(output, "w") as output:
                         json.dump(tweets, output, cls=JSONEncoder)**

except KeyboardInterrupt:
    logging.info("Program interrupted by user. Quitting...")

if name == "main":

calling main function

main()

Query.py

from future import division import random import requests import datetime as dt import json from functools import partial from multiprocessing.pool import Pool

from twitterscraper.tweet import Tweet from twitterscraper.logging import logger

ua1 = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" ua2 = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" ua3 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13" ua4 = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ua5 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201" ua6 = "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16" ua7 = "Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre" HEADERS_LIST = [ua1, ua2, ua3, ua4, ua5, ua6, ua7]

INIT_URL = "https://twitter.com/search?f=tweets&vertical=default&q=**from:**{q}&l={lang}" RELOAD_URL = "https://twitter.com/i/search/timeline?f=tweets&vertical=" \ "default&include_available_features=1&include_entities=1&" \ "reset_error_state=false&src=typd&max_position={pos}&q=from:{q}&l={lang}"

*def linspace(start, stop, n): if n == 1: yield stop return h = (stop - start) / (n - 1) for i in range(n): yield start + h i**

def query_single_page(url, html_response=True, retry=10): """ Returns tweets from the given URL. :param url: The URL to get the tweets from :param html_response: False, if the HTML is embedded in a JSON :param retry: Number of retries if something goes wrong. :return: The list of tweets, the pos argument for getting the next page. """ headers = {'User-Agent': random.choice(HEADERS_LIST)}

try:
    response = requests.get(url, headers=headers)
    if html_response:
        html = response.text or ''
    else:
        html = ''
        try:
            json_resp = json.loads(response.text)
            html = json_resp['items_html'] or ''
        except ValueError as e:
            logger.exception('Failed to parse JSON "{}" while requesting "{}"'.format(e, url))

    tweets = list(Tweet.from_html(html))

    if not tweets:
        return [], None

    if not html_response:
        return tweets, json_resp['min_position']

    return tweets, "TWEET-{}-{}".format(tweets[-1].id, tweets[0].id)
except requests.exceptions.HTTPError as e:
    logger.exception('HTTPError {} while requesting "{}"'.format(
        e, url))
except requests.exceptions.ConnectionError as e:
    logger.exception('ConnectionError {} while requesting "{}"'.format(
        e, url))
except requests.exceptions.Timeout as e:
    logger.exception('TimeOut {} while requesting "{}"'.format(
        e, url))
except json.decoder.JSONDecodeError as e:
    logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format(
        e, url))

if retry > 0:
    logger.info("Retrying... (Attempts left: {})".format(retry))
    return query_single_page(url, html_response, retry-1)

logger.error("Giving up.")
return [], None

def query_tweets_once(query, limit=None, lang=''): """ Queries twitter for all the tweets you want! It will load all pages it gets from twitter. However, twitter might out of a sudden stop serving new pages, in that case, use the query_tweets method. Note that this function catches the KeyboardInterrupt so it can return tweets on incomplete queries if the user decides to abort. :param query: Any advanced query you want to do! Compile it at https://twitter.com/search-advanced and just copy the query! :param limit: Scraping will be stopped when at least limit number of items are fetched. :param num_tweets: Number of tweets fetched outside this function. :return: A list of twitterscraper.Tweet objects. You will get at least limit number of items. """ logger.info("Querying {}".format(query)) query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A") pos = None tweets = [] try: while True: new_tweets, pos = query_single_page( INIT_URL.format(q=query, lang=lang) if pos is None else RELOAD_URL.format(q=query, pos=pos, lang=lang), pos is None ) if len(new_tweets) == 0: logger.info("Got {} tweets for {}.".format( len(tweets), query)) return tweets

        tweets += new_tweets

        if limit and len(tweets) >= limit:
            logger.info("Got {} tweets for {}.".format(
                len(tweets), query))
            return tweets
except KeyboardInterrupt:
    logger.info("Program interrupted by user. Returning tweets gathered "
                 "so far...")
except BaseException:
    logger.exception("An unknown error occurred! Returning tweets "
                      "gathered so far.")
logger.info("Got {} tweets for {}.".format(
    len(tweets), query))
return tweets

def eliminate_duplicates(iterable): """ Yields all unique elements of an iterable sorted. Elements are considered non unique if the equality comparison to another element is true. (In those cases, the set conversion isn't sufficient as it uses identity comparison.) """ class NoElement: pass

prev_elem = NoElement
for elem in sorted(iterable):
    if prev_elem is NoElement:
        prev_elem = elem
        yield elem
        continue

    if prev_elem != elem:
        prev_elem = elem
        yield elem

def query_tweets(query, limit=None, begindate=dt.date(2006,3,21), enddate=dt.date.today(), poolsize=20, lang=''): no_days = (enddate - begindate).days if poolsize > no_days:

Since we are assigning each pool a range of dates to query,

    # the number of pools should not exceed the number of dates.
    poolsize = no_days
dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]

if limit:
    limit_per_pool = (limit // poolsize)+1
else:
    limit_per_pool = None

queries = ['{} since:{} until:{}'.format(query, since, until)
           for since, until in zip(dateranges[:-1], dateranges[1:])]

all_tweets = []
try:
    pool = Pool(poolsize)

    try:
        i=0
        for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
            all_tweets.extend(new_tweets)
            i=i+1
            logger.info("{}: Got {} tweets ({} new).".format(i,
                len(all_tweets), len(new_tweets)))
    except KeyboardInterrupt:
        logger.info("Program interrupted by user. Returning all tweets "
                     "gathered so far.")
finally:
    pool.close()
    pool.join()

return all_tweets
NileshJorwar commented 6 years ago

main.py changes in def_main()

def main(): logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO) try: with open('coname_twitter_account_users.csv', 'r',newline='') as myFile: readInput=csv.reader(myFile) for i,company in enumerate(readInput): if i > 0 and company[1]: print(company[1]) output=company[1]+'.json'
if isfile(output): logging.error("Output file already exists! Aborting.")

continue

                     sys.exit(-1)
                 query1=company[1]
                 #query1='amazon'
                 limit1=100000
                 begindateString = '2006-03-21'
                 begindate1=valid_date(begindateString)
                 enddate1=dt.date.today()
                 poolsize1=20
                 lang1='en'

                 tweets = query_tweets(query = query1, limit = limit1, 
                                       begindate = begindate1, enddate = enddate1, 
                                       poolsize = poolsize1, lang = lang1)

                 if tweets:
                     with open(output, "w") as output:
                         json.dump(tweets, output, cls=JSONEncoder)

except KeyboardInterrupt:
    logging.info("Program interrupted by user. Quitting...")

Query.py Changes

  1. removal of fake user agent 2.addition of fix i.e. in useragent as provided by the fix by Taspinar
  2. addition of method linspace() 4.INIT_URL and RELOAD_URL Change

**from future import division import random import requests import datetime as dt import json from functools import partial from multiprocessing.pool import Pool

from twitterscraper.tweet import Tweet from twitterscraper.logging import logger

ua1 = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" ua2 = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" ua3 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13" ua4 = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ua5 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201" ua6 = "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16" ua7 = "Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre" HEADERS_LIST = [ua1, ua2, ua3, ua4, ua5, ua6, ua7]

INIT_URL = "https://twitter.com/search?f=tweets&vertical=default&q=from:{q}&l={lang}" RELOAD_URL = "https://twitter.com/i/search/timeline?f=tweets&vertical=" \ "default&include_available_features=1&include_entities=1&" \ "reset_error_state=false&src=typd&max_position={pos}&q=from:{q}&l={lang}"

def linspace(start, stop, n): if n == 1: yield stop return h = (stop - start) / (n - 1) for i in range(n): yield start + h * i

def query_single_page(url, html_response=True, retry=10): """ Returns tweets from the given URL. :param url: The URL to get the tweets from :param html_response: False, if the HTML is embedded in a JSON :param retry: Number of retries if something goes wrong. :return: The list of tweets, the pos argument for getting the next page. """ headers = {'User-Agent': random.choice(HEADERS_LIST)}

try:
    response = requests.get(url, headers=headers)
    if html_response:
        html = response.text or ''
    else:
        html = ''
        try:
            json_resp = json.loads(response.text)
            html = json_resp['items_html'] or ''
        except ValueError as e:
            logger.exception('Failed to parse JSON "{}" while requesting "{}"'.format(e, url))

    tweets = list(Tweet.from_html(html))

    if not tweets:
        return [], None

    if not html_response:
        return tweets, json_resp['min_position']

    return tweets, "TWEET-{}-{}".format(tweets[-1].id, tweets[0].id)
except requests.exceptions.HTTPError as e:
    logger.exception('HTTPError {} while requesting "{}"'.format(
        e, url))
except requests.exceptions.ConnectionError as e:
    logger.exception('ConnectionError {} while requesting "{}"'.format(
        e, url))
except requests.exceptions.Timeout as e:
    logger.exception('TimeOut {} while requesting "{}"'.format(
        e, url))
except json.decoder.JSONDecodeError as e:
    logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format(
        e, url))

if retry > 0:
    logger.info("Retrying... (Attempts left: {})".format(retry))
    return query_single_page(url, html_response, retry-1)

logger.error("Giving up.")
return [], None

def query_tweets_once(query, limit=None, lang=''): """ Queries twitter for all the tweets you want! It will load all pages it gets from twitter. However, twitter might out of a sudden stop serving new pages, in that case, use the query_tweets method. Note that this function catches the KeyboardInterrupt so it can return tweets on incomplete queries if the user decides to abort. :param query: Any advanced query you want to do! Compile it at https://twitter.com/search-advanced and just copy the query! :param limit: Scraping will be stopped when at least limit number of items are fetched. :param num_tweets: Number of tweets fetched outside this function. :return: A list of twitterscraper.Tweet objects. You will get at least limit number of items. """ logger.info("Querying {}".format(query)) query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A") pos = None tweets = [] try: while True: new_tweets, pos = query_single_page( INIT_URL.format(q=query, lang=lang) if pos is None else RELOAD_URL.format(q=query, pos=pos, lang=lang), pos is None ) if len(new_tweets) == 0: logger.info("Got {} tweets for {}.".format( len(tweets), query)) return tweets

        tweets += new_tweets

        if limit and len(tweets) >= limit:
            logger.info("Got {} tweets for {}.".format(
                len(tweets), query))
            return tweets
except KeyboardInterrupt:
    logger.info("Program interrupted by user. Returning tweets gathered "
                 "so far...")
except BaseException:
    logger.exception("An unknown error occurred! Returning tweets "
                      "gathered so far.")
logger.info("Got {} tweets for {}.".format(
    len(tweets), query))
return tweets

def eliminate_duplicates(iterable): """ Yields all unique elements of an iterable sorted. Elements are considered non unique if the equality comparison to another element is true. (In those cases, the set conversion isn't sufficient as it uses identity comparison.) """ class NoElement: pass

prev_elem = NoElement
for elem in sorted(iterable):
    if prev_elem is NoElement:
        prev_elem = elem
        yield elem
        continue

    if prev_elem != elem:
        prev_elem = elem
        yield elem

def query_tweets(query, limit=None, begindate=dt.date(2006,3,21), enddate=dt.date.today(), poolsize=20, lang=''): no_days = (enddate - begindate).days if poolsize > no_days:

Since we are assigning each pool a range of dates to query,

    # the number of pools should not exceed the number of dates.
    poolsize = no_days
dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]

if limit:
    limit_per_pool = (limit // poolsize)+1
else:
    limit_per_pool = None

queries = ['{} since:{} until:{}'.format(query, since, until)
           for since, until in zip(dateranges[:-1], dateranges[1:])]

all_tweets = []
try:
    pool = Pool(poolsize)

    try:
        i=0
        for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
            all_tweets.extend(new_tweets)
            i=i+1
            logger.info("{}: Got {} tweets ({} new).".format(i,
                len(all_tweets), len(new_tweets)))
    except KeyboardInterrupt:
        logger.info("Program interrupted by user. Returning all tweets "
                     "gathered so far.")
finally:
    pool.close()
    pool.join()

return all_tweets**
lapp0 commented 6 years ago

@NileshJorwar thanks for your work, but this is really difficult to read. Could you make a pull request instead, or at least upload the files so I can diff it?

NileshJorwar commented 6 years ago

@lapp0 I dont have access to upload the files or make a pull request?? Is there any way that I can send you the mail along with the files to review or way to upload the files to github account??

lapp0 commented 6 years ago

You can attach files to github comments. You also have access to make a pull request, or you could just push your branch.

NileshJorwar commented 6 years ago

main.txt query.txt

taspinar commented 6 years ago

@NileshJorwar , user 'amazon' has at the moment 27.7K tweets. The command twitterscraper "from:amazon" -o amazon.json results in a json file with 25.465 tweets. See attachement. I can't tell why you dont get all of the 27K tweets, especially since I dont know the contents of coname_twitter_account_users.csv. It can be due to many things.

PS: The proper way to make changes is to fork a project and change your local version. See here

amazon.json.zip

NileshJorwar commented 6 years ago

coname_twitter_account_users.csv is just input file that has list of tweeter handles image

I am using csv file to iterate through the tweeter handles to get the tweets for all the accounts.

taspinar commented 6 years ago

Can you try it without the '@' ?

NileshJorwar commented 6 years ago

I tried without @ only . Here is the snapshot of the queries formed

image

NileshJorwar commented 6 years ago

I tried by downloading the current twitterscraper project in my system and ran the following query and got 11K tweets made by amazon for the given which is supposed to be 27K.

image image

lapp0 commented 6 years ago

Again, it looks like you're not using the latest version. Please do pip freeze and share the version of twitterscraper you are using.

NileshJorwar commented 6 years ago

twitterscraper==0.6.1 is the version listed.

NileshJorwar commented 6 years ago

However I upgraded twitterscraper to latest version twitterscraper==0.7.1 and downloaded the tweets with following JSON error and I found tweets somewhat near to actual tweets say 20K out of 27K

image

lapp0 commented 6 years ago

Anaconda has a bug where it may result in multiple versions of a package being installed. I also notice that you aren't getting the "retry" log line after a JSONDecodeError. This indicates to me that you are still running an old version.

Could you please

1) run pip uninstall twitterscraper until pip freeze and pip3 freeze both don't show twitterscraper

2) pip install twitterscraper

3) try again

NileshJorwar commented 6 years ago

@lapp0 I did as said but whenever I am running the pip3 install twitterscraper command the packages inside the Anaconda3 folder gets updated/installed but packages under python python36-32 does not update/install?? Why So?? Below snapshot is after i ran above steps mentioned on my system... image

However I tried on another system(system with higher configuration) with same steps mentioned I found the following errors...

image

taspinar commented 6 years ago

Hi @NileshJorwar @lapp0 I believe this issue and issue #124, #125 are related to and caused by issue #92 .

That is, the search page does not include retweets. The profile page does include retweets. That is why there is an discrepancy between these two numbers.

If you look at the profile page of JunckerEU you will see there are (supposed to be) 1689 tweets. The Search page however only has 1135 tweets. All of the 1135 tweets are scraped by twitterscraper, but of course this number is not equal to 1689. I did not do a full analysis of which tweets are not included in the search page, but the first few retweets do not appear so I'm assuming it is the retweets.

My suggestions is to either fix issue #92 by finding a keyword which includes retweets in the search page, or implement the command line argument -f / --from which makes twitterscraper scrape someones profile page.

lapp0 commented 6 years ago

@taspinar twitterscraper "from:amazon" yielded 25,477 tweets for me three times in a row. So while some tweets are missing because they are retweets, he is still short 5k tweets which he should have been getting.

lapp0 commented 6 years ago

I did some research on retweets. Seems that none of the keywords allowing searching for retweets work anymore as of july 2017 :(

There may be some search logic that allows it, but a few minutes of research yielded nothing that worked.

lapp0 commented 6 years ago

This library gets tweets from profiles rather than search: https://github.com/kennethreitz/twitter-scraper

taspinar commented 6 years ago

@lapp0 @NileshJorwar I did some more digging and found out that the number of tweets on someones profile page, does not have to match the actual number of tweets. Especially if there was some kind of mass-deletion. See here.

This is also visible on the page of @JunckerEU: His profile mentions 1689 tweets, but if you scroll all the way down on his page (so that all his tweets have been loaded) and then in the source code search for js-stream-item you can only count ~800 tweets.

But there is also good news, I will release a new version which can scrape all tweets from a profile page, including retweets from other people.

NileshJorwar commented 6 years ago

I incorporated the changes made above in my originally downloaded files and I also tried to download/update the twitterscraper package using pip install twitterscraper, but it still dont give me all the tweets?