taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.4k stars 578 forks source link

KeyError: 'items_html' when scraping #330

Open erb13020 opened 4 years ago

erb13020 commented 4 years ago

When I am scraping for tweets for a given day, twitterscraper stops scraping tweets for that day and returns the following error.

ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'

Sometimes it will gather up to 20,000 tweets for a certain query on a certain day. Sometimes it will stop at around 20 tweets. Here is my full output for scraping all tweets about 'tesla' on March 1, 2020.

INFO: {'User-Agent': 'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'X-Requested-With': 'XMLHttpRequest'}
Scraping tweets for 1/3/2020
INFO: queries: ['tesla since:2020-03-01 until:2020-03-02']
INFO: Querying tesla since:2020-03-01 until:2020-03-02
INFO: Scraping tweets from https://twitter.com/search?f=tweets&vertical=default&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 119.2.54.204:31322
INFO: Scraping tweets from https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1234265965061386240-1234267078539898880&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 128.199.214.87:3128
ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'
INFO: Got 18 tweets for tesla%20since%3A2020-03-01%20until%3A2020-03-02.
INFO: Got 18 tweets (18 new).
Scraped 13 tweets for 1/3/2020

Here is my code

HEADERS_LIST = [
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13',
    'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',
    'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16',
    'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre'
]

twitterscraper.query.HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'}

def scrape(d, m, y, query):

    begin_date = dt.date(y, m, d)
    end_date = begin_date + dt.timedelta(days=1)

    tweets = query_tweets(query, begindate=begin_date, enddate=end_date)

    df = pd.DataFrame(t.__dict__ for t in tweets)

    print('Scraped ' + str(len(df)) + ' tweets for ' + str(d) + '/' + str(m) + '/' + str(y))

    return df

I have tried looking around and played with different poolsizes but I'm not really sure what the issue is or where to start with fixing it. I am currently using version 1.5.0. Thank you!

Shagun-25 commented 4 years ago

I am also facing the same issue. Any updates on how to fix this?

ashgreat commented 4 years ago

I am getting this error for anything more than 10k tweets

Edit: Now showing up for even a lower number of tweets

taspinar commented 4 years ago

There is an try / except around that block of code, but specific to ValueErrors. I'll add KeyError as well so it does not break like this. You can except a new version on pypi later today. Thanks for bringing this up.

erb13020 commented 4 years ago

@taspinar Thank you so much! I'll let you know if it works.

taspinar commented 4 years ago

It should be fixed in version 1.6.1 available on pypi.

erb13020 commented 4 years ago

Thank you @taspinar , it works and scrapes much more tweets than before

ashgreat commented 4 years ago

It should be fixed in version 1.6.1 available on pypi.

I uninstalled the previous version and installed the new one. I am using CLI and it still throws up the same error.

erb13020 commented 4 years ago

After testing, I'm getting a similar but different error. Its scraping a lot more tweets than before, but I wanted to share what I've been getting with version 1.6.1

Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 107, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'

It doesn't cut off scraping, but this repeats a lot when I run the scraper.

ashgreat commented 4 years ago

In my case, I searched for an airline @SouthwestAir over 2020-01-01 and 2020-03-31 and requested 50,000 tweets. It kept running even when it displays this error. However, the number of tweets returned was less than 200.

erb13020 commented 4 years ago

@ashgreat Same for me, except I was able to get a lot more tweets about Tesla - somewhere in the tens of thousands. Its much better than version 1.5.0

160Bobo commented 4 years ago

I am having the same problem, this is the error that comes up

Failed to parse JSON while requesting "https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1188605683345903616-1188606295571619846&q=bush%20since%3A2019-10-27%20until%3A2019-10-28&l=english"
Traceback (most recent call last):
  File "c:\users\chiar\appdata\local\programs\python\python38-32\lib\site-packages\twitterscraper\query.py", line 107, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'
INFO:twitterscraper:Got 205 tweets (19 new).

I was trying to get 10.000 tweets but I get this error every time. I can scrape 100 tweets but no more than that. I am currently using ver 1.6.1 and still getting this issue.

ThomasADuffy commented 4 years ago

I am getting the same error, I printed the Json_resp and it seems like it is returning a message the json is {'message': 'Sorry, you are rate limited.'}. Maybe they caught wind of this trick and now it doesn't work?

also if any one could explain poolsize and also how to pull the user information that would be great! as of now I am trying to make the json by doing the following


json_lst=[]    
for username in ['@Nike','@UnderArmour','@Adidas']:
        for tweet in query_tweets(username,begindate=dt.date(2017, 1, 1),enddate=dt.date.today(),poolsize=20, lang='', use_proxy=True):
            tweet.__dict__['timestamp'] = str(tweet.__dict__['timestamp'])
            json_lst.append(tweet)
        with open(f'{username}.json', 'a') as f:
            json.dump(json_lst,f, indent=4)
        json_lst = []```
JagritiJ commented 4 years ago

When I am scraping for tweets for a given day, twitterscraper stops scraping tweets for that day and returns the following error.

ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'

Sometimes it will gather up to 20,000 tweets for a certain query on a certain day. Sometimes it will stop at around 20 tweets. Here is my full output for scraping all tweets about 'tesla' on March 1, 2020.

INFO: {'User-Agent': 'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko', 'X-Requested-With': 'XMLHttpRequest'}
Scraping tweets for 1/3/2020
INFO: queries: ['tesla since:2020-03-01 until:2020-03-02']
INFO: Querying tesla since:2020-03-01 until:2020-03-02
INFO: Scraping tweets from https://twitter.com/search?f=tweets&vertical=default&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 119.2.54.204:31322
INFO: Scraping tweets from https://twitter.com/i/search/timeline?f=tweets&vertical=default&include_available_features=1&include_entities=1&reset_error_state=false&src=typd&max_position=TWEET-1234265965061386240-1234267078539898880&q=tesla%20since%3A2020-03-01%20until%3A2020-03-02&l=
INFO: Using proxy 128.199.214.87:3128
ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 173, in query_tweets_once_generator
    new_tweets, new_pos = query_single_page(query, lang, pos)
  File "/home/erb13020/PycharmProjects/untitled/venv/lib/python3.8/site-packages/twitterscraper/query.py", line 100, in query_single_page
    html = json_resp['items_html'] or ''
KeyError: 'items_html'
INFO: Got 18 tweets for tesla%20since%3A2020-03-01%20until%3A2020-03-02.
INFO: Got 18 tweets (18 new).
Scraped 13 tweets for 1/3/2020

Here is my code

HEADERS_LIST = [
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; x64; fr; rv:1.9.2.13) Gecko/20101203 Firebird/3.6.13',
    'Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',
    'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16',
    'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre'
]

twitterscraper.query.HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'}

def scrape(d, m, y, query):

    begin_date = dt.date(y, m, d)
    end_date = begin_date + dt.timedelta(days=1)

    tweets = query_tweets(query, begindate=begin_date, enddate=end_date)

    df = pd.DataFrame(t.__dict__ for t in tweets)

    print('Scraped ' + str(len(df)) + ' tweets for ' + str(d) + '/' + str(m) + '/' + str(y))

    return df

I have tried looking around and played with different poolsizes but I'm not really sure what the issue is or where to start with fixing it. I am currently using version 1.5.0. Thank you!

Same here

clbarrell commented 4 years ago

Looks like @ThomasADuffy identified the issue with #333

{'message': 'Sorry, you are rate limited.'} --- This is the json_resp.

HeroadZ commented 4 years ago

Is there any method to fix it?

ThomasADuffy commented 4 years ago

as of now, Not sure. adding a sleep timer possibly with a couple of seconds might help. It probably is flagging the IP address coming in because its going through pages too fast.

HeroadZ commented 4 years ago

@ThomasADuffy Thank you for you advices. But with a sleep timer, we get data with more than 1 time. Will the data be redundant that we might get same tweets sometimes?

HeroadZ commented 4 years ago

The time sleeper can't handle if I want to scrape more than 10k tweets in one query. If I split the limit into the smaller one like 1k and using a time sleeper, the 1k data is always the same. (And it seems like the limit parameter is not work here, the length of data is variable.)

So is there any idea on query_single_page? We could split the limit into the number of pages and scrape in a for-loop. I have no idea with the parameter "pos".

abhilashpanda04 commented 4 years ago

After multiple try and updating query.py i am also stuck with error items_html, May be this because the new API from twitter. Waiting for a fix.