taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.4k stars 578 forks source link

Scraper does not collect tweets when querying specifically for 'covid' or 'Covid' #320

Open erb13020 opened 4 years ago

erb13020 commented 4 years ago

I'm trying to gather a dataset of tweets containing the word 'covid' using this library. I've been using this library for a while and never had any issue but when I search specifically for 'covid', I am not able to scrape any tweets. It works when I try to query for coronavirus, bitcoin, mcdonalds, etc - just not when I search for 'covid'. This is what my output looks like.

https://gyazo.com/6fb6bd3a9dc85ff912b087a456b371c0

I also put this in my code before I even had this issue,

HEADERS_LIST = [ 'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13', 'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201', 'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16', 'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre' ]

so I'm sure that my issue isn't related to https://github.com/taspinar/twitterscraper/issues/316 or https://github.com/taspinar/twitterscraper/issues/296

Here is what one of the query urls looks like in the console output when I run my program

https://twitter.com/search?f=tweets&vertical=default&q=covid%20since%3A2020-02-26%20until%3A2020-02-27&l=

My guess is that the 'Know the Facts' popup is preventing the scraper from querying 'covid' tweets properly, because my program does work with any other search term.

I'm not sure if this is helpful, because my file is 140 lines of code, but here is the function that gets called when I need to scrape. Sorry that the formatting is bad.

def scrape(y, m, query): ''' Returns a dataframe containing all tweets and metadata for a query in a given month and filters for only English tweets.

        Parameters:
                y (int): A 4 digit integer representing the year.
                m (int): A 2 digit integer representing the month.
                query (str): The twitter query.

        Returns:
                df (DataFrame): DataFrame containing all tweets and metadata for a query.
'''
d = __calculate_days(m, y)
begin_date = dt.date(y, m, 1)
end_date = dt.date(y, m, d)

tweets = query_tweets(query, begindate=begin_date, enddate=end_date, poolsize=d)

df = pd.DataFrame(t.__dict__ for t in tweets)

df['lang'] = df['text'].apply(lambda x: detector(x))
df = df[df['lang'] == 'en']

return df

Any thoughts or hints?