ropensci / rtweet

🐦 R client for interacting with Twitter's [stream and REST] APIs
https://docs.ropensci.org/rtweet
Other
786 stars 201 forks source link

Problem with streaming connection #350

Closed francescarluberti closed 3 years ago

francescarluberti commented 5 years ago

Problem

I'm using rtweet and the streaming APIs to collect live tweets on a number of different topics. I've set up the code using a loop, so that tweets collect for an hour into a .json file whose file name is the current date and time, and so that when the loop restarts, tweets from the following hour collect in a new .json file whose file name is the new current date and time. I'm doing so because I want to avoid having .json files that are too big to later parse and analyze (ideally I want all my .json files to be less than 1GB). The problem is that the code runs smoothly for a few hours, but then every now and then I experience an error that says: "The stream disconnected prematurely. Reconnecting..." When that happens, it seems like tweets actually continue collecting (as the .json file's size keeps increasing), but the loop fails to restart, so tweets keep collecting into the same .json file. After several hours, sometimes I finally get the message "Finished streaming tweets!" and the loop manages to start again (and manages to collect tweets in separate .json files every hour again), but the .json file resulting from the premature disconnection is often too big by then.

Is there a way to edit my code so that the stream and the loop run continuously (and keep saving tweets into different .json files every hour) without running into the disconnection error? Also, is it possible that this problem is due to something outside of R or rtwwet, like my computer needing updates?

I've tried to research solutions for disconnections on Twitter's Developers website, but I couldn't find any straightforward solutions to this problem.

Expected behavior

Streaming and collecting live tweets continuously and without any interruptions, but saving the tweets in a different .json file every hour.

Reproduce the problem

## insert code here

library(rtweet)

## Stream keywords used to filter tweets
q <- "#dating, dating, #fiance, fiancé, LGBT, #LGBT, LGBTQ, #LGBTQ, 
LGBTQIA, #LGBTQIA, #loveislove, wife, #wife, #womeninbusiness, #womaninbusiness, 
husband, #husband, marriage, #marriage, wedding, #wedding, bridal, #bridal, bridesmaids,
#bridesmaids, bridesmaid, #bridesmaid, #womenempowerment, #girlpower, #transrightsarehumanrights, 
#womeninSTEM, #womenintech, #womaninbiz, #womeninbiz, #shesaidyes, 
#engagementring, #gaypride, #pridemonth, #LGBTrights, #LGBTQrights, 
#LBTQIArights, #marriageequality, #womenleaders, #femaleleaders,
#womenwholead, #strongwomen, #strongwoman, #nastywoman, #nastywomen,
#bridetobe, #weddingring, #weddingphotogoals, #weddinginspo, 
#weddinginspiration, #weddingdress, #weddinggown, #weddingplanning,
#weddingseason, #womenwhocode, #girlswhocode, #womeninscience, 
#girlsinSTEM, #Rladies, #isaidyes, #diamondring, #gayrights, 
#samesexmarriage, #gaymarriage, #equalmarriage, #womanceo, #womenceos, 
#husbandgoals, #transrights, #transgenderrights, #protecttranskids,
#weddingphotos, #weddingshoes"

## Stream time in seconds so for one minute set timeout = 60
## Stream for 1 hour
streamtime <- 3600

## Loop
n <- 0
repeat {
  tweets <- "tweets"
  file = paste0(sub('\\..*', ' ', tweets), format(Sys.time(),'%d_%m_%Y__%H_%M_%S'), '.json') #saves .json files with date and time as file name

## Collecting tweets
  stream_tweets(q = q, parse = FALSE, timeout = streamtime, file_name = file)

  n <- n+1
  print(n)
  if(n==168){break}} #repeats loop every hour for 168 times (roughly 168 hours, which is a week)

## output error message
The stream disconnected prematurely. Reconnecting... 

rtweet version

## copy/paste output
packageVersion("rtweet")
‘0.6.9’

Session info

## copy/paste output
sessionInfo()
R version 3.6.0 on Windows 10
ZacharyST commented 4 years ago

Instead of using a loop, I use crontab to launch a new script every hour. That script streams for one hour at a time. I'm not sure what crontab's equivalent is on Windows.

mrmvergeer commented 4 years ago

On Windows you can use Task Scheduler. But I believe you can install cron on Windows as well. As for the loop in R, a way to avoid breaking the loop is using "try", or "trycatch". I remember using it some time ago, unrelated to rtweet. https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/try

llrs commented 3 years ago

Not sure what is happening here, could it be that you hit the rate limits? Is there a consistent hour after you start the loop that the stream doesn't work any longer? The message of reconnecting however is bogus and the code doesn't try anything to reconnect.

hadley commented 3 years ago

Once #526 is merged, you'll be able to do something like this:

wait <- 0

repeat {
  path <- format(Sys.time(),'tweets-%d_%m_%Y-%H_%M_%S.json')

  tryCatch({
    stream_tweets(q = q, parse = FALSE, timeout = 3600, file_name = path)

    # Reset wait time after successful request
    wait <<- 0
  }, error = function(e) {
    message("Error: ", e$message)

    wait <- max(wait, 1)
    message("Waiting for ", wait, " seconds")
    Sys.sleep(wait)

    # Exponential back off strategy in order wait more time after each failure
    wait <<- min(wait * 2, 360)
  })
})

(I haven't tested this code, but it should be pretty close to what you need)