ropensci / rtweet

🐦 R client for interacting with Twitter's [stream and REST] APIs
https://docs.ropensci.org/rtweet
Other
785 stars 201 forks source link

Add retryonratelimit for premium search_fullarchive searches #368

Closed kevintaylor closed 3 years ago

kevintaylor commented 4 years ago

rtweet has been great up until this point. Now I am dead in the water with it. I tried to access Twitter's premium full historical search with rtweet. For my project, I have to download the entire set of statuses over a one year period for several hundred Twitter accounts. This is thousands of tweets per account. I wrote the script using rtweet's search_fullarchives after purchasing premium access to Twitter.

The problem is that I quickly get rate limited by Twitter. I tried to use the retryonratelimit=TRUE argument but that is not supported for search_fullarchives.

Can we get support for retryonratelimit for search_fullarchives? Alternatively, is there some sample code you could point me to so as to monitor the rate limit and retry in a loop?

nicolocavalli commented 4 years ago

Hi Kevin, did you find a fix to this? I have the same issue with search_fullarchive and trying to build a loop around it

kevintaylor commented 4 years ago

Hi Nicolo,

There were two issues I ran into with rtweet and search_fullarchive. First, it erroneously consumes too many requests of the Twitter API. To fix this, had to clone the rtweet repo locally and change the line below in search_tweets.R. Then, I had to install the modified package into my R environment.

This change drastically reduced the number of tweets that were being consumed, form an average of about 5 request per twitter account history I was downloading to about 2.25 per account.

if (grepl("fullarchive|30day", query)) {
    params[["premium"]] <- NULL
    params$result_type <- NULL
    if (grepl("full", query)) {
      params$maxResults <- 500 #changed this from 100
    } else {
      params$maxResults <- 100
    }

Secondly, the documentation for search_fullarchive is not very clear. RETRYONRATELIMIT just doesn't apply to this function. The function handles the looping and will return as many tweets as match your search criteria. You don't have to try to rate limit or anything. Here is the ugly code I used. I have some sleep cycles in there but not sure that they are actually necessary. I collected over 300k tweets over a 9 year period for about 900 accounts using versions of this function.

##
# Retrieve the tweets for each founder
##
get_user_tweets <- function(users) {

  api_token <- get_token()
  file <- "founder_tweet_messages.log"

  # Collection list
  user_tweets <- vector(mode = "list", length = nrow(founders))

  for (x in 1:nrow(users)) {
    tryCatch({
      th <- users[x, "twitter_handle"]$twitter_handle
      print(paste("th:", th))
      ad <- users[x, "announced_on"]$announced_on
      print(paste("ad:", ad))

      sd <- ad - months(6)
      print(paste("sd:", sd))
      ed <- ad + months(6)
      print(paste("ed:", ed))

      # If the sd or ed is missing, report issue and skip this record
      if(is.na(sd) || is.na(ed)) {
        write(
          paste(
            "SKIPPING for:",
            th,
            "announced date",
            ad,
            "start date",
            sd,
            "end date",
            ed
          ),
          file,
          append = TRUE
        )
        # Skip to next user in list
        next
      }

      user_tweets[[x]] <- search_fullarchive(
        paste0("from:", th, " -is:retweet"),
        n = 500, # I fixed this in the rweet package, which prev. was hard-coded as 100
        fromDate = sd,
        toDate = ed,
        env_name = "production",
        token = api_token # is this needed? https://github.com/ropensci/rtweet/issues/359
      )

      print(paste("tweets:", nrow(user_tweets[[x]])))

    }, warning = function(war) {
      write(
        paste(
          "WARNING for:",
          th,
          "announced date",
          ad,
          "start date",
          sd,
          "end date",
          ed
        ),
        file,
        append = TRUE
      )
      write(toString(war),
            file,
            append = TRUE)

      message <- war[1]

      if(grepl(message, pattern = "Exceeded rate limit")) {
        # Sleep 16 minutes when rate exceeded
       # This error never came up once I fixed the rtweet package
        print("sleeping 16 minutes")
        Sys.sleep(16*60)
      } else if(grepl(message, pattern = "current package request limits")) {
        print("Exiting script--Twitter account limits reached")
        # Exit if account limits reached for the month
        break
      }
    }, error = function(err) {
      write(
        paste(
          "ERROR for:",
          th,
          "announced date",
          ad,
          "start date",
          sd,
          "end date",
          ed
        ),
        file,
        append = TRUE
      )
      write(toString(err),
            file,
            append = TRUE)
    }) #END TryCatch

    # Sleep 5 second between user requests, limiting api hits to no more than 30 per minute
    # API says up to 60 per minute are allowed so this is a conservative strategy
    # https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search#CountsEndpoint
    print("sleeping 5 secounds")
    Sys.sleep(5)
  }
  return(user_tweets)
}
kevintaylor commented 4 years ago

The fork with the fix is here, if someone needs to use the fix and doesn't want to clone and edit the search_tweets.R code themselves:

https://github.com/kevintaylor/rtweet

vestedinterests commented 4 years ago

Hi Kevin, thank you for contributing and sharing your source code! I am running into a similar issue and wondered whether you could share a bit about your "get_user_tweets" function. Are you only ever sending one request for 500 tweets to full_archive and is that sufficient for each founder? I am also looking to access a larger volume of tweets, but couldn't neatly divide them up into 500 chunks each and wouldn't know where to set the next fromDate and toDate. I am trying to understand whether you solved that issue in your code (and so setup a kind of pagination for large results?) and I am too novice to follow it or whether you didn't have that problem.

simoncarrignon commented 4 years ago

To solve your "pagination" problem @vestedinterests, you should use the "next" token as even with the premium API you won't be able to retrieve more than 500 tweets per requests, thus the change made by @kevintaylor. One problem with this change though is that you can access full search api even without premium account, thus the maxResults = 100 still apply in that case. There are other problems with the `search_fullachive as it should allow to access the "counts" endpoint, as raised in isseu #369. I am just starting to dig into this but in any case your pagination problem should be handled via the "next" token the API returns and I haven't yet find a way to do so.

kevintaylor commented 4 years ago

Are you only ever sending one request for 500 tweets to full_archive and is that sufficient for each founder? I am also looking to access a larger volume of tweets, but couldn't neatly divide them up into 500 chunks each and wouldn't know where to set the next fromDate and toDate. I am trying to understand whether you solved that issue in your code (and so setup a kind of pagination for large results?) and I am too novice to follow it or whether you didn't have that problem.

In my case, I was pulling all tweets for a Twitter user that were posted between a start and end date--a 12-month time span. For one user alone I retrieved over 270k tweets with this code. I didn't have to use next() or otherwise paginate the download of tweets as you can see in the code I posted above.

If I had used my code above for the entire data collection, it would have cost me about $2k to retrieve all my data. But, because of the bug in search_fullarchive that I didn't detect originally, I ended up having to pay nearly $4k for all the data because it was limiting requests to 100 at a time instead of the 500 that the Twitter API allows.

This function returns as many tweets as match your request--you only have to call it once for each user timeline you're retrieving: user_tweets[[x]] <- search_fullarchive( paste0("from:", th, " -is:retweet"), n = 500, # I fixed this in the rweet package, which prev. was hard-coded as 100 fromDate = sd, toDate = ed, env_name = "production", token = api_token # is this needed? https://github.com/ropensci/rtweet/issues/359 )

Good luck!

chris18254 commented 4 years ago

Hi Kevin, Thanks for kindly sharing your code, which works perfectly well in terms mitigating the risk of wasting requests! However, I'm not quite sure whether you've included sleep time to prevent exceeding the rate limit in the fork, or whether you're manually implementing it ach time by pasting the respective code in your queries in R? Thanks and best regards, Chris

kevintaylor commented 4 years ago

Chris,

You just need to add your users to a "users" collection and define the appropriate arguments for the search_fullarchive function. The sleep time function was there as a safety valve but ended up never being triggered once I fixed the code in search_tweets.R. I think you can use my code above as is as long as you have an appropriate collection of users and change the code that calls search_fullarchive so it meets your query needs.

Good luck!

chris18254 commented 4 years ago

Thank you, Kevin! We ran search_fullarchive() with your fork and downloaded approx. 500k Tweets (in chunks of max. 50k tweets, just to play it safe) and didn't have any isses with exceeding the rate limit.

IrenaItova commented 4 years ago

@kevintaylor, thanks a lot for noticing the issue and the fork! I have been trying out search_fullarchive and I am about to use the paid version. Since I am new both with twitter and r, I have a question (probably a very simple one) regarding pagination mentioned by @simoncarrignon . I don't have a list of users and would like to randomly retrieve 1 mio historic tweets between 01-01-2020 and now. This is my code (it's been tested on a sandbox, it works fine):

test5 <- search_fullarchive(q = "(park OR parks) lang:en bounding_box:[-0.3502 51.3902 0.1785 51.6299]", n = 1000000, fromDate = "202001010001", toDate = "202005192359", env = "research", parse = TRUE, token = ActiveTravel_token)

Is there a way I can make only a single call of search_fullarchive using @kevintaylor fork and setting the argument n=1000000 (is it this straightforward?) that would automatically continue the search after 15 min, from max_id (or how can I manually adjust max_id to make sure that I don't use too many requests (e.g. max. 2000)? The documentation is written for more experienced R users and I struggle understanding how to control max_id since it is a rather search_tweets argument. I appreciate your time!

Thanks in advance.

best wishes, Irena

kevintaylor commented 4 years ago

Hi Irena,

First, I am not an expert on rtweet or the Twitter API--I just needed to fix this issue for my own research!

The code you posted should not require pagination. That is my understanding and experience downloading several million tweets with search_fullarchive. It handles all the pagination behind the scenes.

What you'll need to be aware of is that tweets are downloaded in chunks of 500. So, you need to have enough credits purchased and available in your Twitter developer account. I think I had 1,500 chunks per month available, so it took three months for me to get all my data.

Your calculation is 1,000,000/500 = 2,000 chunks you'll need if there are 1,000,000 matching tweets. If you run out of chunks in the middle of the operation, it stops downloading any additional chunks and you would have to wait until the next month to finish (changing your starting date).

Good luck.

diegoreinero commented 4 years ago

Hi @kevintaylor! Thanks so much for sharing your code and kicking off this super useful thread! I'm new to Twitter scraping and trying to figure out a few things (sorry if this is basic or similar to @IrenaItova's question!):

  1. In reading about Twitter's API when using Premium Search, it seems each request (which I assume means each call to Twitter's API?) returns a maximum of 500 tweets (although the default is set to 100, which is why you updated the "maxResults" line of code in search_tweets.R to 500). So for example, if I wanted to scrape 20,000 tweets from the past, that would require 40 requests. Is that all true?

  2. If the above is true, and indeed sets a limit on the number of tweets you can scrape per API request (i.e., 500 tweets), is there also a limit on the number of tweets you can scrape in a given time window (e.g., 15 minutes) using search_fullarchive? Is that what rate limiting is about?

  3. Relatedly, is that what retryonratelimit = TRUE would help you get around by allowing you to pick up where you left off once you hit the max number of tweets allowed to be scraped in that 15 minute time window? I was a bit confused because this appears to be an argument you can set in the search_tweets function for if your search will exceed 18,000 tweets within a 15 minute time window, but in your Dec. 11th 2019 comment you said this argument doesn't apply to the search_fullarchive function. And I see that @chris18254 used search_fullarchive to scrape 500k tweets (in batches of 50k at a time) and didn't have any issues with exceeding the rate limit, so I just wasn't sure.

  4. In your response to @IrenaItova you said that pagination is handled behind the scenes by search_fullarchive and that it's possible to scrape millions of tweets with a single search, but that the thing to keep in mind is having enough credits purchased and available in your Twitter developer account. How much does a credit cost? Is 1 credit = to 1 request? I'm trying to understand how I would estimate the cost of doing a search that might yield say 1 million tweets.

Thanks in advance!

Best, Diego

kevintaylor commented 4 years ago

Hi Diego,

First, let me say I am not an expert on rtweet or the Twitter API. I did use them in my dissertation but as you can tell the search_fullarcive is a little wonky still and the documentation is a bit thin. Below I included some answers, though.

  1. In reading about Twitter's API when using Premium Search, it seems each request (which I assume means each call to Twitter's API?) returns a maximum of 500 tweets (although the default is set to 100, which is why you updated the "maxResults" line of code in search_tweets.R to 500). So for example, if I wanted to scrape 20,000 tweets from the past, that would require 40 requests. Is that all true?

This is correct.

  1. If the above is true, and indeed sets a limit on the number of tweets you can scrape per API request (i.e., 500 tweets), is there also a limit on the number of tweets you can scrape in a given time window (e.g., 15 minutes) using search_fullarchive? Is that what rate limiting is about?

There may be a throttled rate for downloading but if so search_fullarchive handles it in the background.

  1. Relatedly, is that what retryonratelimit = TRUE would help you get around by allowing you to pick up where you left off once you hit the max number of tweets allowed to be scraped in that 15 minute time window? I was a bit confused because this appears to be an argument you can set in the search_tweets function for if your search will exceed 18,000 tweets within a 15 minute time window, but in your Dec. 11th 2019 comment you said this argument doesn't apply to the search_fullarchive function. And I see that @chris18254 used search_fullarchive to scrape 500k tweets (in batches of 50k at a time) and didn't have any issues with exceeding the rate limit, so I just wasn't sure.

When using the search_fullarchive I did not have to use retryonratelimit=TRUE. I did try it but there was no effect from it. I don't believe it has any effect on this function. I downloaded millions of tweets and never needed it.

  1. In your response to @IrenaItova you said that pagination is handled behind the scenes by search_fullarchive and that it's possible to scrape millions of tweets with a single search, but that the thing to keep in mind is having enough credits purchased and available in your Twitter developer account. How much does a credit cost? Is 1 credit = to 1 request? I'm trying to understand how I would estimate the cost of doing a search that might yield say 1 million tweets.

1 credit = 1 API request, so a maximum of 500 tweets using my fork. You have to check out the premium API pricing to see the pricing. They have several plans available.

IrenaItova commented 4 years ago

Hi @diegoreinero ,

I am using the Premium API at the moment with Kevin's fork and it works excellent! Max. 500 Tweets per request. I am paying $399 for one month which allows me 1.25 M Tweets or 500 requests. I agree with @kevintaylor on all his hints he has given you. Here is example of my code:

quiery0020 <- search_fullarchive(q = "(sunbathe OR sunbathing OR sunbather OR sunbathers) lang:en -is:retweet", n = 500, fromDate = "202005031100", toDate = "202005031200", env = "research", parse = TRUE, token = ActiveTravel_token)

I have three things to add from my current experience if they can help you:

best wishes, Irena

diegoreinero commented 4 years ago

Thanks so much @kevintaylor and @IrenaItova! Makes sense! Irena a couple quick follow ups:

I am using the Premium API at the moment with Kevin's fork and it works excellent! Max. 500 Tweets per request. I am paying $399 for one month which allows me 1.25 M Tweets or 500 requests. I agree with @kevintaylor on all his hints he has given you.

Perhaps this is just Twitter's plan, but if you have purchased a plan that allows you 500 requests for one month, shouldn't that mean you can scrape a maximum of 250k tweets in that month (500 requests x 500 tweets max per request)? How is it you could scrape 1.25M with just 500 requests?

Be careful with the n=500 argument-- in my case this argument did not apply at all either. No matter how much I set it up to, it always brought back the maximum available Tweets. In my case, the batches of max. Tweets returned per one run of my code were 2.5k and 5k. To continue the search, I had to adjust the fromDate and toDate to the next period. I did not use pagination at all. I've managed to control the number of returned Tweets by splitting the time period in 1 hour intervals.

I wonder if @kevintaylor's fork overrides the n = argument? In other words, since you set maxResults = 500 in the search_tweet.R code, which is part of the search_fullarchive function, perhaps that takes precedence over the n = argument. No idea but just a thought.

I had problem with geotagged Tweets-- I wanted to return Tweets only relevant to London UK, but because some (or most) users do not share their location with Twitter, the results I got with the bounding_box:[] inside the q = " " argument were very scarce. Finally, I had to drop it to get more Tweets.

These are great points about how if tweets are scarce to begin with you might "waste" a request by not using its maximum potential. I suppose there's no way to know how many tweets a certain search will return other than running the search (right?), so if a search returns 1,050 tweets, I suppose you'll have used 2 requests to their max potential (500 each) and then "wasted" 1 request by only returning 50 tweets. Does that seem right?

Also, when you used bounding_box:[] did you have to get a Google Maps API key? When I would run a search using search_tweets2 and set geocode = lookup_coords("40.76, -73.98, 2mi") R would give a message saying:

lookup_users() requires a Google Maps API key (for instructions on how to acquire one, see: https://developers.google.com/maps/documentation/javascript/tutorial), Do you have a Google Maps API key you'd like to use?

I wasn't sure if you needed a Google Maps API key for using geospatial parameters with search_fullarchive.

Thanks!

IrenaItova commented 4 years ago

Hi @diegoreinero ,

Now that I think of, you are right about the actual Tweets I can collect with my plan, I think the 1.25M on my dashboard refers to the maximum possible per month with a Premium plan.

I did some brief investigation on search_fullarchive code and I don't think that the change Kevin has made overwrites the n argument. My impression was that he had only changed the value of this line of code from 100 to 500. You can see the change via getAnywhere(.search_tweets), but perhaps you can check it yourself better.

params$result_type <- NULL if (grepl("full", query)) { params$maxResults <- 500 } else { params$maxResults <- 100 }

You are right on wasting Tweets. My experience thought me to first experiment on a small time-frame with all the key words I want to search and then see which are more popular and which are less, so I can budget better my big searches, by increasing the time-span fromDate and toDate for the less popular words.

Regarding the geo-spatial issue, I did not have a Google key and I am not aware that this can help, but I've learnt that there are three ways Tweets are geo-tagged, directly related to what Twitter calls a "Profile Geo Enrichment" (which could be working better for PowerTrack subscription, I don't know this). You can read more here:

Tweet meta: (https://developer.twitter.com/en/docs/tutorials/tweet-geo-metadata)

Profile Geo enrichment: https://developer.twitter.com/en/docs/tweets/enrichments/overview/profile-geo

I hope this helps and good luck!

best wishes, Irena

ardiantovn commented 3 years ago

Where is the search_tweets.R file location? I go to ~/rtweet/R , but i just found these files: rtweet,rtweet.rdb,rtweet.rdx, sysdata.rdb, sysdata.rdx.

Thank you

AltfunsMA commented 3 years ago

The behaviour of search_30day and fullarchive with a sandbox token seems to be different to what's described by @kevintaylor and @IrenaItova for their paid subscriptions.

n does determine the total number of tweets that these functions will attempt to get. If you put 100, 500, 2000; that's the exact number of tweets you'll get. This is consistent with the documentation where it says "n Number of tweets to return; it is best to set this number in intervals of 100 for the '30day' API and either 100 (for sandbox) or 500 (for paid) for the 'fullarchive' API. Default is 100." (my emphasis, I believe they meant "in multiples of")

For the sandbox tier, you'll hit an error (and an annoying problem) if you set n to > 3K tweets. It will quickly query 30*100 tweets and then throw an "Exceeded rate limit" warning, referring to the max number of requests per 15 minute window (which is 30 in sandbox). Twitter allows you to download up to 25K in this tier, which would be just fine for my purposes; but the rtweet function simply stops and you lose the pagination.

There is also no easy way to tryCatch your way around it. Unlike get_* functions, search_30day does not take a pagination argument. These premium search functions are just wrappers of a wrapper of a wrapper of the main search function .search_tweets (@ardiantovn, use getAnywhere(.search_tweets) to see it and note the dot before the name); and there seems to be a main pagination function (again not exported) called scroller but I fail to see why it works properly for the basic search_tweets(..., retryonratelimit = TRUE), triggering a 15 min wait; but it doesn't for premium search wrappers.

I currently don't have any solutions; but I've been puzzled by the incongruence between my own experience and the ones reported above. I guess it must be the sandbox token limitations somehow changing the behaviour of search functions through the rate_limit() function... but that's just a guess. Any hints welcome!

ardiantovn commented 3 years ago

@AltfunsMA Thank you very much for the explanation...🙏🏻

VINEET-KAUSHIK commented 3 years ago

Hello everyone, I recently got access to Twitter's 'academic product track' API. I am using 'rtweet' package to search archival data using search_fullarchive() but every time I am getting an error of INVALID TOKEN. Some issue of OAuth2.0 PLEASE HELP!!!

Arf9999 commented 3 years ago

The academic product track is based on Twitter's v2 API, if I'm not mistaken. search_fullarchive() is based on v1.

VINEET-KAUSHIK commented 3 years ago

Thanks, @Arf9999

So, is there any way to download Twitter's v2 API using R. Some other package?

Thanks Vineet

llrs commented 3 years ago

@VINEET-KAUSHIK There is a issue that provides a function to use academic product track #468. v2 is quite new (as it is still in "Early Access"), so not sure if people had time to develop new functions/packages using it.

llrs commented 3 years ago

Also closing the issue, as is a duplicate of #317 and #347, that should be fixed on #375