Closed NileshJorwar closed 6 years ago
make the query from:amazon
, no need to mess with the INIT_URL.
url itself contains from or else i can also change the query to include "from", however thats not an issue i think. In case of reload url which fails as i changed the init_url to load the data from "https://twitter.com/search?f=tweets&vertical=default&q={q}&l={lang}" to "https://twitter.com/search?l=&q=from%3A{q}&src=typd&lang=en" . The new url is advanced query option for the tweeter which gives me tweets of particular tweeter handle but not all the tweets. Hence I need to have reload url in case first does not work.
Sorry, not really understanding what you're saying. Could you describe your goal and why you aren't just passing from:amazon
as the query..
Regarding your comment here https://github.com/taspinar/twitterscraper/issues/118#issuecomment-398834780 are you sure 200,000 tweets exist? Are you sure you aren't being cut off prematurely because there aren't 200,000 tweets that match that search in existence?
@lapp0 I tried as suggested to include the "from in query as from:q" . Though it worked, but it did not include the tweets from the given user handle as I tried to retrieve all the tweets made by amazon tweeter handle but i received only 780 tweets out of 50K tweets. You can also see in the attached below. I want to get all the tweets made by amazon.
Indeed, that is strange. Just to confirm, you are running this command?
twitterscraper from:amazon --output=amazon.json
Actually I modified the script i.e. main.py to include the parameters through the program not command line arguments as below: And rest of the python scripts are same except I made few changes as suggested in latest fixes in query.py to remove fake useragent and modified the URL as suggested by you @lapp0 to include the "from:" in query. The changes are marked in bold. Please look into them and suggest.
""" Created on Thu May 24 11:53:26 2018
@author: Nilesh Jorwar """
import sys import json import logging import collections import datetime as dt from os.path import isfile from twitterscraper.query import query_tweets import csv
class JSONEncoder(json.JSONEncoder): def default(self, obj): if hasattr(obj, 'json'): return obj.json() elif isinstance(obj, collections.Iterable): return list(obj) elif isinstance(obj, dt.datetime): return obj.isoformat() elif hasattr(obj, 'getitem') and hasattr(obj, 'keys'): return dict(obj) elif hasattr(obj, 'dict'): return {member: getattr(obj, member) for member in dir(obj) if not member.startswith('_') and not hasattr(getattr(obj, member), 'call')}
return json.JSONEncoder.default(self, obj)
def valid_date(s): try: return dt.datetime.strptime(s, "%Y-%m-%d").date() except ValueError: msg = "Not a valid date: '{0}'.".format(s) print(msg)
def main():
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
try:
**with open('coname_twitter_account_users.csv', 'r',newline='') as myFile:
readInput=csv.reader(myFile)
for i,company in enumerate(readInput):
if i > 0 and company[1]:
print(company[1])
output=company[1]+'.json'
if isfile(output):
logging.error("Output file already exists! Aborting.")
sys.exit(-1)
query1=company[1]
#query1='amazon'
limit1=100000
begindateString = '2006-03-21'
begindate1=valid_date(begindateString)
enddate1=dt.date.today()
poolsize1=20
lang1='en'
tweets = query_tweets(query = query1, limit = limit1,
begindate = begindate1, enddate = enddate1,
poolsize = poolsize1, lang = lang1)
if tweets:
with open(output, "w") as output:
json.dump(tweets, output, cls=JSONEncoder)**
except KeyboardInterrupt:
logging.info("Program interrupted by user. Quitting...")
if name == "main":
main()
Query.py
from future import division import random import requests import datetime as dt import json from functools import partial from multiprocessing.pool import Pool
from twitterscraper.tweet import Tweet from twitterscraper.logging import logger
ua1 = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" ua2 = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" ua3 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13" ua4 = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ua5 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201" ua6 = "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16" ua7 = "Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre" HEADERS_LIST = [ua1, ua2, ua3, ua4, ua5, ua6, ua7]
INIT_URL = "https://twitter.com/search?f=tweets&vertical=default&q=**from:**{q}&l={lang}" RELOAD_URL = "https://twitter.com/i/search/timeline?f=tweets&vertical=" \ "default&include_available_features=1&include_entities=1&" \ "reset_error_state=false&src=typd&max_position={pos}&q=from:{q}&l={lang}"
*def linspace(start, stop, n): if n == 1: yield stop return h = (stop - start) / (n - 1) for i in range(n): yield start + h i**
def query_single_page(url, html_response=True, retry=10): """ Returns tweets from the given URL. :param url: The URL to get the tweets from :param html_response: False, if the HTML is embedded in a JSON :param retry: Number of retries if something goes wrong. :return: The list of tweets, the pos argument for getting the next page. """ headers = {'User-Agent': random.choice(HEADERS_LIST)}
try:
response = requests.get(url, headers=headers)
if html_response:
html = response.text or ''
else:
html = ''
try:
json_resp = json.loads(response.text)
html = json_resp['items_html'] or ''
except ValueError as e:
logger.exception('Failed to parse JSON "{}" while requesting "{}"'.format(e, url))
tweets = list(Tweet.from_html(html))
if not tweets:
return [], None
if not html_response:
return tweets, json_resp['min_position']
return tweets, "TWEET-{}-{}".format(tweets[-1].id, tweets[0].id)
except requests.exceptions.HTTPError as e:
logger.exception('HTTPError {} while requesting "{}"'.format(
e, url))
except requests.exceptions.ConnectionError as e:
logger.exception('ConnectionError {} while requesting "{}"'.format(
e, url))
except requests.exceptions.Timeout as e:
logger.exception('TimeOut {} while requesting "{}"'.format(
e, url))
except json.decoder.JSONDecodeError as e:
logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format(
e, url))
if retry > 0:
logger.info("Retrying... (Attempts left: {})".format(retry))
return query_single_page(url, html_response, retry-1)
logger.error("Giving up.")
return [], None
def query_tweets_once(query, limit=None, lang=''):
"""
Queries twitter for all the tweets you want! It will load all pages it gets
from twitter. However, twitter might out of a sudden stop serving new pages,
in that case, use the query_tweets
method.
Note that this function catches the KeyboardInterrupt so it can return
tweets on incomplete queries if the user decides to abort.
:param query: Any advanced query you want to do! Compile it at
https://twitter.com/search-advanced and just copy the query!
:param limit: Scraping will be stopped when at least limit
number of
items are fetched.
:param num_tweets: Number of tweets fetched outside this function.
:return: A list of twitterscraper.Tweet objects. You will get at least
limit
number of items.
"""
logger.info("Querying {}".format(query))
query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A")
pos = None
tweets = []
try:
while True:
new_tweets, pos = query_single_page(
INIT_URL.format(q=query, lang=lang) if pos is None
else RELOAD_URL.format(q=query, pos=pos, lang=lang),
pos is None
)
if len(new_tweets) == 0:
logger.info("Got {} tweets for {}.".format(
len(tweets), query))
return tweets
tweets += new_tweets
if limit and len(tweets) >= limit:
logger.info("Got {} tweets for {}.".format(
len(tweets), query))
return tweets
except KeyboardInterrupt:
logger.info("Program interrupted by user. Returning tweets gathered "
"so far...")
except BaseException:
logger.exception("An unknown error occurred! Returning tweets "
"gathered so far.")
logger.info("Got {} tweets for {}.".format(
len(tweets), query))
return tweets
def eliminate_duplicates(iterable): """ Yields all unique elements of an iterable sorted. Elements are considered non unique if the equality comparison to another element is true. (In those cases, the set conversion isn't sufficient as it uses identity comparison.) """ class NoElement: pass
prev_elem = NoElement
for elem in sorted(iterable):
if prev_elem is NoElement:
prev_elem = elem
yield elem
continue
if prev_elem != elem:
prev_elem = elem
yield elem
def query_tweets(query, limit=None, begindate=dt.date(2006,3,21), enddate=dt.date.today(), poolsize=20, lang=''): no_days = (enddate - begindate).days if poolsize > no_days:
# the number of pools should not exceed the number of dates.
poolsize = no_days
dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]
if limit:
limit_per_pool = (limit // poolsize)+1
else:
limit_per_pool = None
queries = ['{} since:{} until:{}'.format(query, since, until)
for since, until in zip(dateranges[:-1], dateranges[1:])]
all_tweets = []
try:
pool = Pool(poolsize)
try:
i=0
for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
all_tweets.extend(new_tweets)
i=i+1
logger.info("{}: Got {} tweets ({} new).".format(i,
len(all_tweets), len(new_tweets)))
except KeyboardInterrupt:
logger.info("Program interrupted by user. Returning all tweets "
"gathered so far.")
finally:
pool.close()
pool.join()
return all_tweets
main.py changes in def_main()
def main():
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging.INFO)
try:
with open('coname_twitter_account_users.csv', 'r',newline='') as myFile:
readInput=csv.reader(myFile)
for i,company in enumerate(readInput):
if i > 0 and company[1]:
print(company[1])
output=company[1]+'.json'
if isfile(output):
logging.error("Output file already exists! Aborting.")
sys.exit(-1)
query1=company[1]
#query1='amazon'
limit1=100000
begindateString = '2006-03-21'
begindate1=valid_date(begindateString)
enddate1=dt.date.today()
poolsize1=20
lang1='en'
tweets = query_tweets(query = query1, limit = limit1,
begindate = begindate1, enddate = enddate1,
poolsize = poolsize1, lang = lang1)
if tweets:
with open(output, "w") as output:
json.dump(tweets, output, cls=JSONEncoder)
except KeyboardInterrupt:
logging.info("Program interrupted by user. Quitting...")
Query.py Changes
**from future import division import random import requests import datetime as dt import json from functools import partial from multiprocessing.pool import Pool
from twitterscraper.tweet import Tweet from twitterscraper.logging import logger
ua1 = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1" ua2 = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" ua3 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13" ua4 = "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" ua5 = "Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201" ua6 = "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16" ua7 = "Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre" HEADERS_LIST = [ua1, ua2, ua3, ua4, ua5, ua6, ua7]
INIT_URL = "https://twitter.com/search?f=tweets&vertical=default&q=from:{q}&l={lang}" RELOAD_URL = "https://twitter.com/i/search/timeline?f=tweets&vertical=" \ "default&include_available_features=1&include_entities=1&" \ "reset_error_state=false&src=typd&max_position={pos}&q=from:{q}&l={lang}"
def linspace(start, stop, n): if n == 1: yield stop return h = (stop - start) / (n - 1) for i in range(n): yield start + h * i
def query_single_page(url, html_response=True, retry=10): """ Returns tweets from the given URL. :param url: The URL to get the tweets from :param html_response: False, if the HTML is embedded in a JSON :param retry: Number of retries if something goes wrong. :return: The list of tweets, the pos argument for getting the next page. """ headers = {'User-Agent': random.choice(HEADERS_LIST)}
try:
response = requests.get(url, headers=headers)
if html_response:
html = response.text or ''
else:
html = ''
try:
json_resp = json.loads(response.text)
html = json_resp['items_html'] or ''
except ValueError as e:
logger.exception('Failed to parse JSON "{}" while requesting "{}"'.format(e, url))
tweets = list(Tweet.from_html(html))
if not tweets:
return [], None
if not html_response:
return tweets, json_resp['min_position']
return tweets, "TWEET-{}-{}".format(tweets[-1].id, tweets[0].id)
except requests.exceptions.HTTPError as e:
logger.exception('HTTPError {} while requesting "{}"'.format(
e, url))
except requests.exceptions.ConnectionError as e:
logger.exception('ConnectionError {} while requesting "{}"'.format(
e, url))
except requests.exceptions.Timeout as e:
logger.exception('TimeOut {} while requesting "{}"'.format(
e, url))
except json.decoder.JSONDecodeError as e:
logger.exception('Failed to parse JSON "{}" while requesting "{}".'.format(
e, url))
if retry > 0:
logger.info("Retrying... (Attempts left: {})".format(retry))
return query_single_page(url, html_response, retry-1)
logger.error("Giving up.")
return [], None
def query_tweets_once(query, limit=None, lang=''):
"""
Queries twitter for all the tweets you want! It will load all pages it gets
from twitter. However, twitter might out of a sudden stop serving new pages,
in that case, use the query_tweets
method.
Note that this function catches the KeyboardInterrupt so it can return
tweets on incomplete queries if the user decides to abort.
:param query: Any advanced query you want to do! Compile it at
https://twitter.com/search-advanced and just copy the query!
:param limit: Scraping will be stopped when at least limit
number of
items are fetched.
:param num_tweets: Number of tweets fetched outside this function.
:return: A list of twitterscraper.Tweet objects. You will get at least
limit
number of items.
"""
logger.info("Querying {}".format(query))
query = query.replace(' ', '%20').replace("#", "%23").replace(":", "%3A")
pos = None
tweets = []
try:
while True:
new_tweets, pos = query_single_page(
INIT_URL.format(q=query, lang=lang) if pos is None
else RELOAD_URL.format(q=query, pos=pos, lang=lang),
pos is None
)
if len(new_tweets) == 0:
logger.info("Got {} tweets for {}.".format(
len(tweets), query))
return tweets
tweets += new_tweets
if limit and len(tweets) >= limit:
logger.info("Got {} tweets for {}.".format(
len(tweets), query))
return tweets
except KeyboardInterrupt:
logger.info("Program interrupted by user. Returning tweets gathered "
"so far...")
except BaseException:
logger.exception("An unknown error occurred! Returning tweets "
"gathered so far.")
logger.info("Got {} tweets for {}.".format(
len(tweets), query))
return tweets
def eliminate_duplicates(iterable): """ Yields all unique elements of an iterable sorted. Elements are considered non unique if the equality comparison to another element is true. (In those cases, the set conversion isn't sufficient as it uses identity comparison.) """ class NoElement: pass
prev_elem = NoElement
for elem in sorted(iterable):
if prev_elem is NoElement:
prev_elem = elem
yield elem
continue
if prev_elem != elem:
prev_elem = elem
yield elem
def query_tweets(query, limit=None, begindate=dt.date(2006,3,21), enddate=dt.date.today(), poolsize=20, lang=''): no_days = (enddate - begindate).days if poolsize > no_days:
# the number of pools should not exceed the number of dates.
poolsize = no_days
dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, no_days, poolsize+1)]
if limit:
limit_per_pool = (limit // poolsize)+1
else:
limit_per_pool = None
queries = ['{} since:{} until:{}'.format(query, since, until)
for since, until in zip(dateranges[:-1], dateranges[1:])]
all_tweets = []
try:
pool = Pool(poolsize)
try:
i=0
for new_tweets in pool.imap_unordered(partial(query_tweets_once, limit=limit_per_pool, lang=lang), queries):
all_tweets.extend(new_tweets)
i=i+1
logger.info("{}: Got {} tweets ({} new).".format(i,
len(all_tweets), len(new_tweets)))
except KeyboardInterrupt:
logger.info("Program interrupted by user. Returning all tweets "
"gathered so far.")
finally:
pool.close()
pool.join()
return all_tweets**
@NileshJorwar thanks for your work, but this is really difficult to read. Could you make a pull request instead, or at least upload the files so I can diff
it?
@lapp0 I dont have access to upload the files or make a pull request?? Is there any way that I can send you the mail along with the files to review or way to upload the files to github account??
You can attach files to github comments. You also have access to make a pull request, or you could just push your branch.
@NileshJorwar , user 'amazon' has at the moment 27.7K tweets.
The command twitterscraper "from:amazon" -o amazon.json
results in a json file with 25.465 tweets. See attachement.
I can't tell why you dont get all of the 27K tweets, especially since I dont know the contents of coname_twitter_account_users.csv
. It can be due to many things.
PS: The proper way to make changes is to fork a project and change your local version. See here
coname_twitter_account_users.csv is just input file that has list of tweeter handles
I am using csv file to iterate through the tweeter handles to get the tweets for all the accounts.
Can you try it without the '@' ?
I tried without @ only . Here is the snapshot of the queries formed
I tried by downloading the current twitterscraper project in my system and ran the following query and got 11K tweets made by amazon for the given which is supposed to be 27K.
Again, it looks like you're not using the latest version. Please do pip freeze
and share the version of twitterscraper you are using.
twitterscraper==0.6.1 is the version listed.
However I upgraded twitterscraper to latest version twitterscraper==0.7.1 and downloaded the tweets with following JSON error and I found tweets somewhat near to actual tweets say 20K out of 27K
Anaconda has a bug where it may result in multiple versions of a package being installed. I also notice that you aren't getting the "retry" log line after a JSONDecodeError. This indicates to me that you are still running an old version.
Could you please
1) run pip uninstall twitterscraper
until pip freeze
and pip3 freeze
both don't show twitterscraper
2) pip install twitterscraper
3) try again
@lapp0 I did as said but whenever I am running the pip3 install twitterscraper command the packages inside the Anaconda3 folder gets updated/installed but packages under python python36-32 does not update/install?? Why So?? Below snapshot is after i ran above steps mentioned on my system...
However I tried on another system(system with higher configuration) with same steps mentioned I found the following errors...
Hi @NileshJorwar @lapp0 I believe this issue and issue #124, #125 are related to and caused by issue #92 .
That is, the search page does not include retweets. The profile page does include retweets. That is why there is an discrepancy between these two numbers.
If you look at the profile page of JunckerEU you will see there are (supposed to be) 1689 tweets. The Search page however only has 1135 tweets. All of the 1135 tweets are scraped by twitterscraper, but of course this number is not equal to 1689. I did not do a full analysis of which tweets are not included in the search page, but the first few retweets do not appear so I'm assuming it is the retweets.
My suggestions is to either fix issue #92 by finding a keyword which includes retweets in the search page, or implement the command line argument -f / --from
which makes twitterscraper scrape someones profile page.
@taspinar twitterscraper "from:amazon"
yielded 25,477 tweets for me three times in a row. So while some tweets are missing because they are retweets, he is still short 5k tweets which he should have been getting.
I did some research on retweets. Seems that none of the keywords allowing searching for retweets work anymore as of july 2017 :(
There may be some search logic that allows it, but a few minutes of research yielded nothing that worked.
This library gets tweets from profiles rather than search: https://github.com/kennethreitz/twitter-scraper
@lapp0 @NileshJorwar I did some more digging and found out that the number of tweets on someones profile page, does not have to match the actual number of tweets. Especially if there was some kind of mass-deletion. See here.
This is also visible on the page of @JunckerEU:
His profile mentions 1689 tweets, but if you scroll all the way down on his page (so that all his tweets have been loaded) and then in the source code search for js-stream-item
you can only count ~800 tweets.
But there is also good news, I will release a new version which can scrape all tweets from a profile page, including retweets from other people.
I incorporated the changes made above in my originally downloaded files and I also tried to download/update the twitterscraper package using pip install twitterscraper, but it still dont give me all the tweets?
I want to retrieve all the tweets by account wise. Say @amazon has tweeted 27.8K tweets till date, how to retrieve all the tweets made by amazon rather than the tweets in which amazon keyword is included. However i tried using latest advanced search query option in INIT_URL as "https://twitter.com/search?l=&q=from%3A{q}&src=typd&lang=en" but could not find reload url for the same. But this option does not give me whole tweets as I need to modify the script tweet.py to retrieve the tweeter data using tags by BeautifulSoup in from_soup method.