search_tweets functu - Githubissues

hussainshehadeh commented 7 years ago

I am trying to use the function search_tweets, but instead of manually inputing the query, I have it in a dataframe. Is there a way were the function can read the text from the dataframe?

Here is my code Retweeters1 <- search_tweets(Tweet2, n=100) , were Tweet2 is a dataframe.

Crackz commented 7 years ago

i guess you want to use every cell in a specific column for searching if so

queryString <- paste0(unlist(Tweet2$queryStrings), collapse = " ")
Retweeters1 <- search_tweets(queryString , n=100)

hussainshehadeh commented 7 years ago

The queryString is not giving me anything. Its empty, whats the solution?

mrmvergeer commented 7 years ago

try this code. You'll get a data frame for each keyword:

keyword <- readLines("keywords.txt") df.key <- paste("keyword_",1:length(keyword),sep="") for (i in 1:length(keyword)) { d.frame <- search_tweets(keyword[i], n=100) assign(df.key[i], d.frame) Sys.sleep(1) }

mkearney commented 7 years ago

I think both of those answers above will work depending on the context. If you have a column of queries, you could also write a function like this to vectorize search_tweets():

#' search_tweets_queries
#'
#' @param x Vector of search queries.
#' @param n Number of tweets to return per query. Defaults to 100.
#' @param \dots Other arguments passed on to \code{search_tweets}.
#' @return A tbl data frame with additional "query" feature.
search_tweets_queries <- function(x, n = 100, ...) {
  ## check inputs
  stopifnot(is.atomic(x), is.numeric(n))
  if (length(x) == 0L) {
    stop("No query found", call. = FALSE)
  }  
  ## search for each string in column of queries
  rt <- lapply(x, search_tweets, n = n, ...)
  ## add query variable to data frames
  rt <- Map(cbind, rt, query = x, stringsAsFactors = FALSE)
  ## merge users data into one data frame
  rt_users <- do.call("rbind", lapply(rt, users_data))
  ## merge tweets data into one data frame
  rt <- do.call("rbind", rt)
  ## set users attribute
  attr(rt, "users") <- rt_users
  ## return tibble (validate = FALSE makes it a bit faster)
  tibble::as_tibble(rt, validate = FALSE)
}

You could then pass a column of [multiple] queries to the function

## create data frame with query column
Tweet2 <- data.frame(
  query = c("\"rstats\"", "\"data science\""),
  n = rnorm(2),
  stringsAsFactors = FALSE
)

## pass query column on to the new function defined above
rt <- search_tweets_queries(Tweet2$query)
Searching for tweets...
Finished collecting tweets!
Searching for tweets...
Finished collecting tweets!

## preview data
> rt
# A tibble: 200 x 39
            status_id          created_at            user_id    screen_name
                <chr>              <dttm>              <chr>          <chr>
 1 888478382916227077 2017-07-21 19:18:48          384227341       alkadrii
 2 888478270106349568 2017-07-21 19:18:21 821976676842242050  DeborahTannon
 3 888478148425273344 2017-07-21 19:17:52 781019469875412992      wittmaan1
 4 888478118788435968 2017-07-21 19:17:45           65105528 fabianmmueller
 5 888478116179562496 2017-07-21 19:17:44         1036896870       stwtseng
 6 888477761446318080 2017-07-21 19:16:20          248192696       mmmgaber
 7 888477509913911296 2017-07-21 19:15:20         3230388598      dataandme
 8 888477439369904128 2017-07-21 19:15:03          144592995      Rbloggers
 9 888477433912893440 2017-07-21 19:15:02           14993767  ramnarasimhan
10 888477233609834496 2017-07-21 19:14:14          136276078         bekinc
# ... with 190 more rows, and 35 more variables: text <chr>, source <chr>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>, query <chr>

hussainshehadeh commented 7 years ago

All what I have is a 1 column, 1 row. I want to search the tweet from that text.

mkearney commented 7 years ago

Ahh all you need then is

rt <- search_tweets(Tweet2[1, 1])

Or

rt <- search_tweets(Tweet2[[1]])

hussainshehadeh commented 7 years ago

It searches, but I get empty results and this message:

Searching for tweets...
Finished collecting tweets!
Warning message:
In chars[!ok] <- unlist(lapply(chars[!ok], encode)) :
  number of items to replace is not a multiple of replacement length

mkearney commented 7 years ago

Either the value in your data frame isn't making a good query or, more likely, you've run into a bug in an older version. Try installing the latest version:

detach("package:rtweet")
if (!"devtools" %in% installed.packages()) {
  install.packages("devtools")
}
devtools::install_github("mkearney/rtweet")

You may have to restart your session--sometimes it throws a fit when you install during a session that's already been using rtweet.

If you get the same message, then it'll probably be your data, but I bet the message goes away with an update.

hussainshehadeh commented 7 years ago

I am getting this now Error: is.atomic(q) is not TRUE

And whats the latest rtweet version? 0.4.8?

mkearney commented 7 years ago

Yes, that's the most recent version.

Can you post your code? It's hard for me to figure out what exactly is going on with just the error message.

One potential problem would be if you have strings as factors:

> x <- data.frame(a = "\"rstats\"")
> search_tweets(x[1, 1])
Error in nchar(q) : 'nchar()' requires a character vector
> x[1, 1]
[1] "rstats"
Levels: "rstats"
> x <- data.frame(a = "\"rstats\"", stringsAsFactors = FALSE)
> x[1, 1]
[1] "\"rstats\""
> search_tweets(x[1, 1])
Searching for tweets...
Finished collecting tweets!
# A tibble: 100 x 38
            status_id          created_at            user_id   screen_name
                <chr>              <dttm>              <chr>         <chr>
 1 888533131707301888 2017-07-21 22:56:21           14639660   FatPlatypus
 2 888533024463343618 2017-07-21 22:55:55          215035672    beeonaposy
 3 888532553950310400 2017-07-21 22:54:03           13074042    juliasilge
 4 888532189868118016 2017-07-21 22:52:36         1096058449      itatiVCS
 5 888531807175491585 2017-07-21 22:51:05         2944647704 andrew_benesh
 6 888530848399503360 2017-07-21 22:47:17 848804341880242176    indra_eko3
 7 888530831819628544 2017-07-21 22:47:13          347261357     DaveRubal
 8 888530779470520325 2017-07-21 22:47:00           69460360     YvesMessy
 9 888530620506394625 2017-07-21 22:46:22          560431626      antuki13
10 888530334643625987 2017-07-21 22:45:14         1544327005   bigboardsio
# ... with 90 more rows, and 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <chr>, reply_to_user_id <chr>,
#   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

As for your specific error message, it says you're not supplying an atomic vector as your query, which means you're probably supplying the data frame.

x is the data frame from above. It only comes back FALSE if it's left as a data frame.

> is.atomic(x[[1]])
[1] TRUE
> is.atomic(x[1, 1])
[1] TRUE
> is.atomic(x[1, ])
[1] TRUE
> is.atomic(x[, ])
[1] TRUE
> is.atomic(x)
[1] FALSE

Probably the safest thing to do would be to specify the column, the first observation, and use as.character().

## variable name. first obs. as character.
rt <- search_tweets(as.character(x[[varname]][1]))

## first column. first row. as.character.
rt <- search_tweets(as.character(x[[1]][1]))

hussainshehadeh commented 7 years ago

Here is the full code:

`

#Extracting retweeter details
A <- search_tweets("#dayofrage", n=1000)

#Extract tweets that do not include RT in the text
A11<- A[!grepl("RT @", A$text),]

#Take tweets that have between 10 and 15000 retweets
AAA<- subset(A11, retweet_count >= 10 & retweet_count < 100)

#Selected tweet with maximum number of retweets
ABB <- AAA[which.max(AAA$retweet_count),]

#Find the user who tweeted the tweet
User <- as.data.frame(ABB[4], drop=FALSE)

Tweet <- as.data.frame(ABB[5], drop=FALSE)
TweetPrint <- print(Tweet[1,1])
Tweet2 <- data.frame(TweetPrint)

rt <- search_tweets(Tweet2[1])

`

hussainshehadeh commented 7 years ago

What I am trying to do is, to look for tweets with the hashtag #Dayofrage and then search for tweets that received between 10 to 100 retweets, and see if the tweeter has less than 75,000 followers. IF so I want the details of that tweet to search for the retweeters.

mkearney commented 7 years ago

Here's what I would do. First, to get the data you're looking for (tweets using #dayofrage hashtag from users with fewer than 75k followers).

library(rtweet)
library(dplyr)

## search for day of rage tweets
dor <- search_tweets("DayofRage", n = 18000)

## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
  users_data() %>%
  unique() %>%
  right_join(dor) %>%
  filter(!is_retweet) %>%
  dplyr::select(status_id, retweet_count, followers_count, text)

## filter using both conditions
dat <- dat %>%
  filter(retweet_count >= 10 & followers_count < 75000)

Soon, you'll be able to use statuses_retweets() (or something similarly named) to get data directly on the retweeters. But I have only added the skeleton of that new API call, so it's not quite ready yet.

For now, you can take part of the text of a tweet and use search_tweets() to get the retweet data. If it's a long tweet, or if it includes certain symbols it seems, passing the entire text of a tweet to search_tweets() only seems to return the original tweet. This is mostly because retweets get truncated (adding the "RT: " at the beginning of the string can push retweets over the otherwise observed 140 character limit, so retweets are often truncated in the API calls). But using only part of the tweet seems to work great:

> ## text of first tweet
> dat$text[1]
[1] "A perfect depiction of what happened at the #DayOfRage protests - #Israel didnt want Muslims praying. FULL STOP \nhttps://t.co/TDlfb8mpwA https://t.co/bhKspzufsH"
> 
> ## search for part of tweet
> rts <- search_tweets("A perfect depiction of what happened at the #DayOfRage protests")
Searching for tweets...
Finished collecting tweets!
> rts
# A tibble: 13 x 38
            status_id          created_at            user_id    screen_name
                <chr>              <dttm>              <chr>          <chr>
 1 888759971831136256 2017-07-22 13:57:44          297242407      178kakapo
 2 888753150085263360 2017-07-22 13:30:37 789570670510403584       therkut1
 3 888748221031477248 2017-07-22 13:11:02 781224213684387840       Lougris2
 4 888746029952192512 2017-07-22 13:02:20          448086598 AnonymoonKheir
 5 888745050649350148 2017-07-22 12:58:26          568828814     rico_hands
 6 888738686581583872 2017-07-22 12:33:09           10903242       aliakcay
 7 888734864396124161 2017-07-22 12:17:58         1527652524 ahmed_mekhamer
 8 888734105197662208 2017-07-22 12:14:57          316113260    A7medHakami
 9 888732239026159616 2017-07-22 12:07:32         1605509936   DrMAMMohamed
10 888730370522333184 2017-07-22 12:00:06          873317130        20tree9
11 888730293577826304 2017-07-22 11:59:48          723325134    moonicegang
12 888730262489747456 2017-07-22 11:59:41         3301703894       nazem239
13 888729925724844036 2017-07-22 11:58:20           81136269  MiddleEastMnt
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

Of course the downside is this last part of selecting which part of the tweet to use as the query would be difficult to automate. Assuming the retweets() APIs get up and running soon, though, this shouldn't be a problem for much longer!

hussainshehadeh commented 7 years ago

Okay the top part looks perfect, and its running perfect. Is there a ways were I can then search for the retweeters? Like extract the first 6 words from the retweet, and add

RT @...: Tweet

If this works then it solves my whole issue.

hussainshehadeh commented 7 years ago

I don't mind doing this for the whole list in dat

mkearney commented 7 years ago

That's a good question. It may be enough to just take the first few words-- you can always inspect later to see if the text of the tweets is the same for all tweets.

The code below will automate the process. Searching for the first 8 or so words works (see data1). And, I wasn't sure it would work, but your idea of adding RT to the beginning of each string did as well (see data2)!

You can see the number of returned observations differs slightly. I'm guessing your RT approach is the way to go. In fact, I'm probably going to steal this idea and use it for other things as well :).


> x <- lapply(strsplit(dat$text, " "), "[", 1:8)
> x <- lapply(x, na.omit)
> x <- vapply(x, paste, collapse = " ", character(1))
> x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
> x <- sapply(x, URLencode, USE.NAMES = FALSE)
> ## first 8(ish) words of each tweet
> data1 <- lapply(x[1:10], search_tweets, verbose = FALSE)
> ## include explicit "RT:" at beginning
> data2 <- lapply(paste("RT: ", x[1:10]), search_tweets, verbose = FALSE)
> ## compare N of obs for each method
> sapply(data1, nrow)
 [1]  10  16   9  10  15  13  13 100  28  25
> sapply(data2, nrow)
 [1]   8  15   7   8  13  11   9 100  27  24
> ## preview RT method data
> data2
[[1]]
# A tibble: 8 x 38
           status_id          created_at            user_id   screen_name
               <chr>              <dttm>              <chr>         <chr>
1 888789521445289984 2017-07-22 15:55:09           49804099    moniqueb54
2 888787439816515587 2017-07-22 15:46:53         3248144099  ArmingAttano
3 888781441785528320 2017-07-22 15:23:03           77715831 LauriersRoses
4 888778665743589376 2017-07-22 15:12:01          111381605     framboazz
5 888778614803705856 2017-07-22 15:11:49         2612041957   Mariableuee
6 888777605910069250 2017-07-22 15:07:48 848689021995896832  Tiberdanie93
7 888777300019474433 2017-07-22 15:06:35         3245309031    AidGarRmnc
8 888776293365559296 2017-07-22 15:02:35         2938109319       OPHIUSE
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[2]]
# A tibble: 15 x 38
            status_id          created_at            user_id     screen_name
                <chr>              <dttm>              <chr>           <chr>
 1 888779036117405696 2017-07-22 15:13:29          103537650       samanello
 2 888770017378095104 2017-07-22 14:37:39         2267411377       SaMioN777
 3 888769517287075840 2017-07-22 14:35:40          952494301     dnoblesolja
 4 888764962889494529 2017-07-22 14:17:34          110065142 captain_amarito
 5 888760219446235136 2017-07-22 13:58:43           62322497     ratcatcher2
 6 888759097088122880 2017-07-22 13:54:15         2897026934       adi_marji
 7 888758110118170625 2017-07-22 13:50:20           41994475          olajam
 8 888758004820004864 2017-07-22 13:49:55         2959660574      hanaandidi
 9 888757702981066753 2017-07-22 13:48:43 871766034214211584      MuslimFrom
10 888757425884418052 2017-07-22 13:47:37          782493620         wnwinjz
11 888757309777879041 2017-07-22 13:47:09 828618053462196224     NADALRAFAT1
12 888757129049460736 2017-07-22 13:46:26          958083241    LJ_Brodigan1
13 888757113094377474 2017-07-22 13:46:22 843492310352642048 KhalidM12094546
14 888756807732166657 2017-07-22 13:45:09         2580813590   ElaineNiddery
15 888756646700351488 2017-07-22 13:44:31          219336060  jairogarciacol
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[3]]
# A tibble: 7 x 38
           status_id          created_at    user_id    screen_name
               <chr>              <dttm>      <chr>          <chr>
1 888769483120283648 2017-07-22 14:35:31   41722795      MoniMatos
2 888760938345713665 2017-07-22 14:01:34  537210019          zrbar
3 888755751069466625 2017-07-22 13:40:57  608510310     Rayisray42
4 888751302729240576 2017-07-22 13:23:17 1202276508     Palosypala
5 888750082304073728 2017-07-22 13:18:26  103681269 PalestineFacts
6 888750079430885376 2017-07-22 13:18:25   88183759           itv5
7 888749234345959425 2017-07-22 13:15:04 1551994398   JJohnexley46
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[4]]
# A tibble: 8 x 38
           status_id          created_at            user_id     screen_name
               <chr>              <dttm>              <chr>           <chr>
1 888779979823210497 2017-07-22 15:17:14         2663970236 TheRedTherapist
2 888751797862629376 2017-07-22 13:25:15 722498830307102720         moqolof
3 888749815462580224 2017-07-22 13:17:22          430055091        M_Beaman
4 888749654824951808 2017-07-22 13:16:44         3020018765 BenjaminKweskin
5 888749504853397504 2017-07-22 13:16:08 883808086657552385     lukestevo91
6 888749244064124928 2017-07-22 13:15:06         1468290938          naz548
7 888746415824015360 2017-07-22 13:03:52          421918538     Chordiegurl
8 888746179370000384 2017-07-22 13:02:55           16568637     marcynewman
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[5]]
# A tibble: 13 x 38
            status_id          created_at            user_id    screen_name
                <chr>              <dttm>              <chr>          <chr>
 1 888786153196838913 2017-07-22 15:41:46           49766803         RLZaki
 2 888778871184785413 2017-07-22 15:12:50          103537650      samanello
 3 888759971831136256 2017-07-22 13:57:44          297242407      178kakapo
 4 888748221031477248 2017-07-22 13:11:02 781224213684387840       Lougris2
 5 888746029952192512 2017-07-22 13:02:20          448086598 AnonymoonKheir
 6 888745050649350148 2017-07-22 12:58:26          568828814     rico_hands
 7 888738686581583872 2017-07-22 12:33:09           10903242       aliakcay
 8 888734864396124161 2017-07-22 12:17:58         1527652524 ahmed_mekhamer
 9 888734105197662208 2017-07-22 12:14:57          316113260    A7medHakami
10 888732239026159616 2017-07-22 12:07:32         1605509936   DrMAMMohamed
11 888730370522333184 2017-07-22 12:00:06          873317130        20tree9
12 888730293577826304 2017-07-22 11:59:48          723325134    moonicegang
13 888730262489747456 2017-07-22 11:59:41         3301703894       nazem239
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[6]]
# A tibble: 11 x 38
            status_id          created_at            user_id     screen_name
                <chr>              <dttm>              <chr>           <chr>
 1 888752844492472321 2017-07-22 13:29:25         2267411377       SaMioN777
 2 888750364597657600 2017-07-22 13:19:33         2738627253  UzairVedachhia
 3 888745652481642496 2017-07-22 13:00:50          169626752 MushyMelbowHead
 4 888745072824659970 2017-07-22 12:58:32          568828814      rico_hands
 5 888730702581084160 2017-07-22 12:01:25          873317130         20tree9
 6 888728692402335744 2017-07-22 11:53:26          473851713   ButcherMartin
 7 888728091916369921 2017-07-22 11:51:03          250610814       MunaAmina
 8 888727617406468097 2017-07-22 11:49:10          723325134     moonicegang
 9 888725176183136256 2017-07-22 11:39:28         2725537397      saifi_inam
10 888724023168860160 2017-07-22 11:34:53 712779203863818241    madeniexdan1
11 888723387517894660 2017-07-22 11:32:21         2717291006      NahimShah1
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[7]]
# A tibble: 9 x 38
           status_id          created_at            user_id     screen_name
               <chr>              <dttm>              <chr>           <chr>
1 888750768802504705 2017-07-22 13:21:10          280391708        1ncamera
2 888746347985285120 2017-07-22 13:03:36 745259762196037632 vincemalumbono2
3 888745085512364032 2017-07-22 12:58:35          568828814      rico_hands
4 888721779988058113 2017-07-22 11:25:58          235893581      Palkomitee
5 888720487903318019 2017-07-22 11:20:50         2964861784   SlimaniKheira
6 888718429661913089 2017-07-22 11:12:39           44657691       Nazeera_L
7 888718277106708481 2017-07-22 11:12:03          219630174        khosseni
8 888717687672713220 2017-07-22 11:09:42          347951637  musulmanfrance
9 888717001333633024 2017-07-22 11:06:59 843492310352642048 KhalidM12094546
# ... with 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[8]]
# A tibble: 100 x 38
            status_id          created_at            user_id screen_name
                <chr>              <dttm>              <chr>       <chr>
 1 888798186411663360 2017-07-22 16:29:35          404004591    biodivan
 2 888787611438985216 2017-07-22 15:47:34           81850406    jibran11
 3 888784441182756864 2017-07-22 15:34:58           49766803      RLZaki
 4 888782769974681601 2017-07-22 15:28:19          364082745     iizajun
 5 888782701284741121 2017-07-22 15:28:03          730931066     Haych_K
 6 888781750159249409 2017-07-22 15:24:16         1599580136   Arwa_A_11
 7 888778809562017793 2017-07-22 15:12:35          409762432  frawlaqmor
 8 888776520298143745 2017-07-22 15:03:29 766394253706686464  matt_ogdie
 9 888776075232333825 2017-07-22 15:01:43         2364544256 AmirbehnamM
10 888776059851866112 2017-07-22 15:01:39         3938910364     nshsfgl
# ... with 90 more rows, and 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[9]]
# A tibble: 27 x 38
            status_id          created_at            user_id     screen_name
                <chr>              <dttm>              <chr>           <chr>
 1 888769581094805505 2017-07-22 14:35:55          876374886          fahume
 2 888768246975000576 2017-07-22 14:30:37         1901643534     IamaCamera2
 3 888768235608317952 2017-07-22 14:30:34 812884772431151104       rrredRaiN
 4 888760019491147776 2017-07-22 13:57:55         1041879914 jamesjmcmenamin
 5 888757861735583744 2017-07-22 13:49:21         3219202662    nazonorekisi
 6 888749527318093824 2017-07-22 13:16:14          545992640   garryowen1888
 7 888748498585346048 2017-07-22 13:12:08          325110024      tosh_allan
 8 888745942693941248 2017-07-22 13:01:59          123771138         khussh_
 9 888739321154383876 2017-07-22 12:35:40          710862930 fabricedelamort
10 888738891217457152 2017-07-22 12:33:58         2677180819     barbarosa69
# ... with 17 more rows, and 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

[[10]]
# A tibble: 24 x 38
            status_id          created_at            user_id     screen_name
                <chr>              <dttm>              <chr>           <chr>
 1 888790484633649152 2017-07-22 15:58:59         1287447402     Cybrarian64
 2 888783833612652544 2017-07-22 15:32:33         1599580136       Arwa_A_11
 3 888779237431197698 2017-07-22 15:14:17          737608020  CaesarZaccaria
 4 888765299075543044 2017-07-22 14:18:54           22644834 Bogsideandproud
 5 888762932540178433 2017-07-22 14:09:30          143593029   OceanadeSilva
 6 888747364562989056 2017-07-22 13:07:38          837484490    susannefoort
 7 888745007783399424 2017-07-22 12:58:16          568828814      rico_hands
 8 888741615086575617 2017-07-22 12:44:47 842888141673394178       AquaDeAce
 9 888730346224664577 2017-07-22 12:00:01          873317130         20tree9
10 888728460616773634 2017-07-22 11:52:31 839903300250185729       Nordin75V
# ... with 14 more rows, and 34 more variables: text <chr>, source <chr>,
#   reply_to_status_id <lgl>, reply_to_user_id <lgl>,
#   reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
#   favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
#   media_url <list>, media_t.co <list>, media_expanded_url <list>,
#   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
#   ext_media_expanded_url <list>, ext_media_type <lgl>,
#   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
#   quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
#   retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
#   country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
#   bbox_coords <list>

mkearney commented 7 years ago

Btw, you may want to set n higher just in case.

df <- lapply(x, search_tweets, verbose = FALSE, n = 1000)

Or whatever number you set it at, just follow up with more searches for anything that hits the cap.

hussainshehadeh commented 7 years ago

I am a bit confused now. Okay let me explain exactly what I am trying to get:

Search for a hashtag
Find the tweet that received between 10 and 100 retweets
Check for tweeters that have 10,000 to 70,000 followers
If so, find the retweeters of that tweet
Then download the list of followers of that user (who tweeted), and the list of retweeters (of the tweet)
Then determine the percentage of users who follow the user and retweeted the post.
Finally I want the username, number of followers, number of retweets, tweet, and the percentage to be added to a sheet.

hussainshehadeh commented 7 years ago

The current code is searching for hashtag, it finds the right tweets, it finds the right users, but how can I use it to find the retweeters, then download the list of followers of the tweeter? The rest is manageable.

hussainshehadeh commented 7 years ago

Let me show you how I used to carry out this process but manually:

library(rtweet)

#Extracting retweeter details
A <- search_tweets("RT @DaddyJew: I'm not battling depression.", n=1000)

#Extract the second column from A. Which is the unique user IDs.
B <- as.data.frame(A[2], drop=FALSE)

# Drop any duplicate user names:
C <- unique(B)

#Get followers of a specific user
DaddyJew <- get_followers("DaddyJew", n = 75000, page = "-1", parse = TRUE, token = NULL)

I used to take the retweeters like this, and then download the list, with the followers list. and determine the %. (I used to find the percentage by exporting the data to excel, and use excel functions.)

I want to do this automatically now.

mkearney commented 7 years ago

1. Search for tweets using the hashtag

library(rtweet)
library(dplyr)

## search for day of rage tweets
dor <- search_tweets("DayofRage", n = 18000)

2. Filter data to only include (a) tweets with 10 or more retweets posted by (b) users with fewer than 75k followers

## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
  users_data() %>%
  unique() %>%
  right_join(dor) %>%
  filter(!is_retweet) %>%
  dplyr::select(status_id, retweet_count, followers_count, text) %>%
  filter(retweet_count >= 10 & followers_count < 75000)

3. Find users who retweeted the filtered tweets.

## get only first 8 words from each tweet
x <- lapply(strsplit(dat$text, " "), "[", 1:8)
x <- lapply(x, na.omit)
x <- vapply(x, paste, collapse = " ", character(1))
## get rid of hyperlinks
x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
## encode for search query (handles the non ascii chars)
x <- sapply(x, URLencode, USE.NAMES = FALSE)
## get up to first 100 retweets for each tweet
data <- lapply(x, search_tweets, verbose = FALSE)

4. Get followers for original tweet users

## remove duplicates for original tweeters
tweeters <- unique(dat$user_id)
## this is rate limited so you'll have to construct a for loop with a sleep call
flws <- lapply(tweeters[1:15], get_followers)

5. Get followers for retweeters

## remove duplicates for retweeters
retweeters <- unique(data$user_id)
## this is rate limited so you'll have to construct a for loop with a sleep call
reflws <- lapply(retweeters[1:15], get_followers)

mkearney commented 7 years ago

@hussainshehadeh my last post was posted before I looked at your most recent ones. After lunch I'll look at this again though!

hussainshehadeh commented 7 years ago

Okay the code may be solving my issue, but I am a beginner and finding it difficult to extract what I want. I don't know how to create the loop with a sleep call. Is it possible to show me how?

To make it a bit clear, I just want to get 4 columns at the end that shows the number of followers for the user, the number of retweets received, the percentage, and finally the tweet. (I want like 1,000 results, so I am expecting I need to add the loop with sleep call thing)

I know the code above may be doing what I am asking for, but I do not get how to extract the results, and place them in 4 columns...

I am sorry if I am asking for much.

mkearney commented 7 years ago

I actually think you already have everything you need in number 1 and number 2 that I posted above.

You can create a new data frame with those four columns:

df <- data.frame( 
  user = dat$screen_name,
  retweets = dat$retweet_count,
  pct_of_followers = dat$retweet_count / dat$followers_count,
  tweet = dat$text,
  stringsAsFactor = FALSE
)

And save it to CSV / open it in excel

write.csv(df, "tweets_retweeters_percents.csv", row.names = FALSE)

hussainshehadeh commented 7 years ago

I am getting the following message when I run the df :

Error in data.frame(user = dat$screen_name, retweets = dat$retweet_count, : arguments imply differing number of rows: 0, 14, 1 In addition: Warning message: Unknown or uninitialised column: 'screen_name'.

Moreover, I am not looking for the percentage of retweets/followers, but rather the percentage of people who retweeted the tweet, and are following the user. The way to achieve this (or how I manually used to gather the %) is by downloading the list of followers, then downloading the list of retweeters, and looking for the overlap. Then divide the number of overlap by the tweets, and find the %.

For example: User X who tweeted A, received 100 retweets. From these 100 retweets, 50 were following him, while the other 50 are not following him. So, 50% is the answer. (I am looking for the % of 1st degree followers, who retweeted the post.)

mkearney commented 7 years ago

Sorry, you need to make a quick adjustment. Here's no. 2 from above modified. The resulting output, dat, should have what you want

## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
  users_data() %>%
  unique() %>%
  right_join(dor) %>%
  filter(!is_retweet) %>%
  dplyr::select(screen_name, retweet_count, followers_count, text) %>%
  filter(retweet_count >= 10 & followers_count < 75000)
dat$pct_of_followers <- dat$retweet_count / dat$followers_count
dat

hussainshehadeh commented 7 years ago

@mkearney I edited my last post

hussainshehadeh commented 7 years ago

Your last code works fine, but as I mentioned, I am looking for a different percentage that requires the followers list of the tweeter, and the retweeters to find the overlap and calculate the percentage.

mkearney commented 7 years ago

It's a little bit complicated because the users could have up to 75,000 followers. A single API call can retrieve 5,000 followers. You get 15 of those every 15 minutes. If they all had fewer than 5,000 followers, you could do something like this:

## vector of users
users <- unique(dat$screen_name)

## initialize output object
flws <- vector("list", length(users))

## execute `i` loops (one for each obs in users) 
for (i in seq_along(users)) {
  ## get follows for user i
  flws[[i]] <- get_followers(users[i], n = 5000)
  ## add user variable
  flws[[i]]$user <- users[i]
  ## every 15th loop, sleep for 15 minutes (unless it's the final loop)
  if (i %% 15 == 0L && < length(users)) {
    ## 60 seconds * 15 = 15 minutes
    Sys.sleep(60 * 15)
  }
}

Ideally, you'd have control over the exact rate limit.

token <- get_tokens()
if (!inherits(token, "Token")) token <- token[[1]]

## rate limit
rl <- rate_limit(token, "followers/ids")

## number of API calls left
rl$remaining

## number of seconds until rate limit reset
as.numeric(rl$reset, "secs")

mkearney commented 7 years ago

I know this doesn't lay it all out for you, but I'm confident you now have all the pieces. I'll hopefully add some support for getting all the followers for multiple users, but that's unlikely to happen in the next few days as I'm moving cities next week. If you have more questions I'll try to respond, but I probably won't be quite as responsive as I have been.

ropensci / rtweet

search_tweets functu #98

1. Search for tweets using the hashtag

2. Filter data to only include (a) tweets with 10 or more retweets posted by (b) users with fewer than 75k followers

3. Find users who retweeted the filtered tweets.

4. Get followers for original tweet users

5. Get followers for retweeters