jrosen48 / comparing-sentiment

1 stars 0 forks source link

recursively adding replies #10

Open jrosen48 opened 3 years ago

jrosen48 commented 3 years ago

I believe these lines work correctly to recursively search and add replies to the original dataset:

  tar_target(file_name_for_sample_of_tweets, here::here("data", "sample-of-tweets.rds"), format = "file"),
  tar_target(sample_of_tweets_for_thread_finding, read_rds(file_name_for_sample_of_tweets)),
  tar_target(extracted_status_ids, extract_status_ids(sample_of_tweets_for_thread_finding)),
  tar_target(replies_that_were_recursively_searched, get_replies_recursive(extracted_status_ids)),
  tar_target(original_tweets_with_replies_added, combine_original_with_reply_tweets(sample_of_tweets_for_thread_finding, replies_that_were_recursively_searched)),

For the sample used - sample-of-tweets.rds, which I uploaded the the data directory of our new OneDrive folder - around 200 tweets are added.

Just flagging this here, as I am going to tag this in on a related issue for something that's not yet working reliably - returning an ID for what thread a tweet belongs to.

Just tagging you here @conradborchers, nothing to do.

jrosen48 commented 3 years ago

one thing you foreshadowed earlier, @conradborchers - indeed, rate limits may be hit. from this sample, we add roughly 1/5 the number of original tweets we have - which would mean that we would add roughly 120,000 new tweets to our dataset were this run not with the sample of tweets, but all of our data.

conradborchers commented 3 years ago

I see... In the worst case, I will add some shell commands that halt the console, will look into this later!

conradborchers commented 3 years ago

Assuming that this thread is now about discussing the code to access replies:

get_replies_recursive <- function(statuses) {

  statuses <- statuses[!is.na(statuses)]

  new_data <- rtweet::lookup_statuses(statuses)

  print(paste0("In this iteration, accessed ", nrow(new_data), " new Tweets"))

  new_statuses <- new_data$reply_to_status_id[!is.na(new_data$reply_to_status_id)]

  if (length(new_statuses) > 0) { # if there are replies to statuses not yet in the data
    new_data_recursive <- get_replies_recursive(new_statuses) # get the tweets that were replied to
    out_data <- bind_rows(new_data, new_data_recursive) # and bind together the replies and the original tweets
  } else { # if there are no replies left to get
    return(new_data) # return the replies
  }
}

This works, and this is really cool. Only thing I wanted to add is that in the way we use this function right now, we look up the statuses we already have in the first iteration which could be confusing for users. For example, I just feed in 100 status_ids that we already have and the function printed "accessed 98 new tweets" in the first iteration. The number 98 stems from the fact that apparently 2 of these 100 tweets have been deleted recently. Could we possibly change this for later use?

conradborchers commented 3 years ago

This alternative version of the function should work for avoiding rate limits:

get_replies_recursive <- function(statuses, total_n=0, first_timestamp=Sys.time()) { # in first iteration, total sum of looked-up tweet is 0

  if (length(statuses) > 90000){
    print("Please start with a number of statuses smaller than 90,000")
    print("Process aborted")
    return(0)
  }

  statuses <- statuses[!is.na(statuses)]

  reference_timestamp <- Sys.time()

  diff <- difftime(reference_timestamp, first_timestamp, units = "mins") %>% as.numeric()

  total_n <- total_n+length(statuses)

  if (total_n > 90000 & diff < 15) {  # if 90k tweets accessed under 15 mins
    make_15_mins_full <- ceiling(15 - diff)
    print(paste0("Rate limit reached. Setting console to sleep for ", make_15_mins_full, " minutes"))
    Sys.sleep(make_15_mins_full * 60)   # sleep for minutes * 60 seconds
    total_n <- 0                        # update references
    first_timestamp <- Sys.time()
  }

  new_data <- rtweet::lookup_statuses(statuses)

  print(paste0("In this iteration, accessed ", nrow(new_data), " new Tweets"))
  print(paste0("Current rate is a total of ", total_n, " new Tweets"))
  print(paste0("Current timeframe is ", diff, " minutes"))

  new_statuses <- new_data$reply_to_status_id[!is.na(new_data$reply_to_status_id)]

  if (length(new_statuses) > 0) { # if there are replies to statuses not yet in the data
    new_data_recursive <- get_replies_recursive(new_statuses, 
                                                total_n=total_n,
                                                first_timestamp = first_timestamp) # get the tweets that were replied to
    out_data <- bind_rows(new_data, new_data_recursive) # and bind together the replies and the original tweets
  } else { # if there are no replies left to get
    return(new_data) # return the replies
  }
}

Let me know if this works for you and we can come back to this later

conradborchers commented 3 years ago

Minor: You mean readRDS here?

tar_target(sample_of_tweets_for_thread_finding, read_rds(file_name_for_sample_of_tweets)),