Closed francescarluberti closed 3 years ago
Instead of using a loop, I use crontab to launch a new script every hour. That script streams for one hour at a time. I'm not sure what crontab's equivalent is on Windows.
On Windows you can use Task Scheduler. But I believe you can install cron on Windows as well. As for the loop in R, a way to avoid breaking the loop is using "try", or "trycatch". I remember using it some time ago, unrelated to rtweet. https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/try
Not sure what is happening here, could it be that you hit the rate limits? Is there a consistent hour after you start the loop that the stream doesn't work any longer? The message of reconnecting however is bogus and the code doesn't try anything to reconnect.
Once #526 is merged, you'll be able to do something like this:
wait <- 0
repeat {
path <- format(Sys.time(),'tweets-%d_%m_%Y-%H_%M_%S.json')
tryCatch({
stream_tweets(q = q, parse = FALSE, timeout = 3600, file_name = path)
# Reset wait time after successful request
wait <<- 0
}, error = function(e) {
message("Error: ", e$message)
wait <- max(wait, 1)
message("Waiting for ", wait, " seconds")
Sys.sleep(wait)
# Exponential back off strategy in order wait more time after each failure
wait <<- min(wait * 2, 360)
})
})
(I haven't tested this code, but it should be pretty close to what you need)
Problem
I'm using rtweet and the streaming APIs to collect live tweets on a number of different topics. I've set up the code using a loop, so that tweets collect for an hour into a .json file whose file name is the current date and time, and so that when the loop restarts, tweets from the following hour collect in a new .json file whose file name is the new current date and time. I'm doing so because I want to avoid having .json files that are too big to later parse and analyze (ideally I want all my .json files to be less than 1GB). The problem is that the code runs smoothly for a few hours, but then every now and then I experience an error that says: "The stream disconnected prematurely. Reconnecting..." When that happens, it seems like tweets actually continue collecting (as the .json file's size keeps increasing), but the loop fails to restart, so tweets keep collecting into the same .json file. After several hours, sometimes I finally get the message "Finished streaming tweets!" and the loop manages to start again (and manages to collect tweets in separate .json files every hour again), but the .json file resulting from the premature disconnection is often too big by then.
Is there a way to edit my code so that the stream and the loop run continuously (and keep saving tweets into different .json files every hour) without running into the disconnection error? Also, is it possible that this problem is due to something outside of R or rtwwet, like my computer needing updates?
I've tried to research solutions for disconnections on Twitter's Developers website, but I couldn't find any straightforward solutions to this problem.
Expected behavior
Streaming and collecting live tweets continuously and without any interruptions, but saving the tweets in a different .json file every hour.
Reproduce the problem
rtweet version
Session info