iSarabjitDhiman / TweeterPy

TweeterPy is a python library to extract data from Twitter. TweeterPy API lets you scrape data from a user's profile like username, userid, bio, followers/followings list, profile media, tweets, etc.
MIT License
123 stars 17 forks source link

Add Rate limit stats to response data - Exception handling #20

Closed codilau closed 10 months ago

codilau commented 11 months ago

Any chance to have raised Exceptions surface instead of being printed in the background? I need to handle rate_limit_exhausted and use the exception to choose another .pkl session

iSarabjitDhiman commented 11 months ago

Umm, You can apply this temporary fix. (I might add some permanent fix to the code soon, it sounds like a great idea to me.) You can handle it in a couple of ways:

If you want to handle it before the data is returned, you have to make changes to the code to shuffle the session before it hits the rate limit. However, if you are planning to change the session after it returns the response data, you can add an extra key-value pair with the rate limits data and can use the cursor_endpoint to continue after shuffling the session.

Question : How do I keep a track of rate limits? Answer : Navigate to the following line in request_util module

api_limit_stats = util.check_api_rate_limits(response)

You can assign this api_limit_stats to some temporary variable which is available everywhere. I am going to use config module for this.

api_limit_stats = util.check_api_rate_limits(response)
config.RATE_LIMIT = {'rate_limit':api_limit_stats}

Now api_stats are available in the config file. You can navigate to the tweeterpy.py module and look for _handle_pagination. If you want to handle the session within the code, just check if config.RATE_LIMIT.get('rate_limit',{}).get('rate_limit_exhausted',None) is True. and shuffle the session. Or just add this rate_limit data to data_container in _handle_pagination method (if you want to shuffle the sessions on your end), when it returns the data, check for the rate_limit in returned data and use try except block with some loop to handle it.

# you can add the api_stats to data container in _handle_pagination method
data_container['data'].extend(filter_data(data))
data_container['rate_limit'] = config.RATE_LIMIT

I hope this answered your question, let me know if you need any help.

codilau commented 11 months ago

Will try shortly. I tried initially to raise a custom error from make_request through _handle_pagination through search that contains the original error + the data but that didn't work out quite as planned.

iSarabjitDhiman commented 11 months ago

Will try shortly. I tried initially to raise a custom error from make_request through _handle_pagination through search that contains the original error + the data but that didn't work out quite as planned.

Yes, it won't work. I mean yes you can raise the error from the request_util module but when the _handle_pagination handles that error exception, it just prints the error and then returns the data_container. I chose to do it that way so that no matter what sort of error it is, user always gets the data. But if you want to go the other way, instead of printing the error, you can raise the exception in except block in _handle_pagination method, but then you have store that data in realtime in some global variable or database so that when the error is raised you dont lose your data.

iSarabjitDhiman commented 10 months ago

Hey @codilau, @fpmirabile

Hey I just added this rate limit feature. I am going to commit changes soon.

API rate limit stats/data will always be available in config._RATE_LIMIT_STATS , just in case you want to use it in some other way.

If you want to manually get access to the rate limits :

from tweeterpy import config print(config._RATE_LIMIT_STATS) NOTE : It will only show the rate limit data for the most recent request you made. For instance if you recently scraped user friends, the rate limit data will be related to that API endpoint, same goes for the other endpoints. You can check the file logs (tweeterpy.log) for thorough api stats.

Here is how you can implement it into your code:

# if you are fetching a lot of data, avoid using while loop. I am using it only for demonstration purposes.
from tweeterpy import TweeterPy
from tweeterpy.util import RateLimitError # you can use this custom exception to throw error when rate limit gets exhausted

twitter = TweeterPy()
user_tweets = {"data": [],"cursor_endpoint": None, "has_next_page": True, "api_rate_limit":{}}
has_next_page = True
while has_next_page:
    try:
        tweets = twitter.get_user_tweets('elonmusk',total=50) # setting 50 on purpose, just for demonstration purpose.
        cursor_endpoint = tweets.get('cursor_endpoint', None)
        api_rate_limit = tweets.get('api_rate_limit', {})
        has_next_page = tweets.get('has_next_page', False)
        limit_exhausted = api_rate_limit.get("rate_limit_exhausted")
        user_tweets['data'].extend(tweets['data'])
        user_tweets['cursor_endpoint'] = cursor_endpoint
        user_tweets['has_next_page'] = has_next_page
        user_tweets['api_rate_limit'] = api_rate_limit
        if not has_next_page:
            break
        if has_next_page and limit_exhausted:
            raise RateLimitError
    except RateLimitError:
        #handle exception here, shuffle account or wait (time.sleep) until rate limit gets expired. Reset datetime python object is in tweets.get('reset_after_datetime_object') 
    except:
        pass

If you have any questions, let me know. Edit : Already Implemented.