Closed hussainshehadeh closed 7 years ago
i guess you want to use every cell in a specific column for searching if so
queryString <- paste0(unlist(Tweet2$queryStrings), collapse = " ")
Retweeters1 <- search_tweets(queryString , n=100)
The queryString is not giving me anything. Its empty, whats the solution?
try this code. You'll get a data frame for each keyword:
keyword <- readLines("keywords.txt") df.key <- paste("keyword_",1:length(keyword),sep="") for (i in 1:length(keyword)) { d.frame <- search_tweets(keyword[i], n=100) assign(df.key[i], d.frame) Sys.sleep(1) }
I think both of those answers above will work depending on the context. If you have a column of queries, you could also write a function like this to vectorize search_tweets()
:
#' search_tweets_queries
#'
#' @param x Vector of search queries.
#' @param n Number of tweets to return per query. Defaults to 100.
#' @param \dots Other arguments passed on to \code{search_tweets}.
#' @return A tbl data frame with additional "query" feature.
search_tweets_queries <- function(x, n = 100, ...) {
## check inputs
stopifnot(is.atomic(x), is.numeric(n))
if (length(x) == 0L) {
stop("No query found", call. = FALSE)
}
## search for each string in column of queries
rt <- lapply(x, search_tweets, n = n, ...)
## add query variable to data frames
rt <- Map(cbind, rt, query = x, stringsAsFactors = FALSE)
## merge users data into one data frame
rt_users <- do.call("rbind", lapply(rt, users_data))
## merge tweets data into one data frame
rt <- do.call("rbind", rt)
## set users attribute
attr(rt, "users") <- rt_users
## return tibble (validate = FALSE makes it a bit faster)
tibble::as_tibble(rt, validate = FALSE)
}
You could then pass a column of [multiple] queries to the function
## create data frame with query column
Tweet2 <- data.frame(
query = c("\"rstats\"", "\"data science\""),
n = rnorm(2),
stringsAsFactors = FALSE
)
## pass query column on to the new function defined above
rt <- search_tweets_queries(Tweet2$query)
Searching for tweets...
Finished collecting tweets!
Searching for tweets...
Finished collecting tweets!
## preview data
> rt
# A tibble: 200 x 39
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888478382916227077 2017-07-21 19:18:48 384227341 alkadrii
2 888478270106349568 2017-07-21 19:18:21 821976676842242050 DeborahTannon
3 888478148425273344 2017-07-21 19:17:52 781019469875412992 wittmaan1
4 888478118788435968 2017-07-21 19:17:45 65105528 fabianmmueller
5 888478116179562496 2017-07-21 19:17:44 1036896870 stwtseng
6 888477761446318080 2017-07-21 19:16:20 248192696 mmmgaber
7 888477509913911296 2017-07-21 19:15:20 3230388598 dataandme
8 888477439369904128 2017-07-21 19:15:03 144592995 Rbloggers
9 888477433912893440 2017-07-21 19:15:02 14993767 ramnarasimhan
10 888477233609834496 2017-07-21 19:14:14 136276078 bekinc
# ... with 190 more rows, and 35 more variables: text <chr>, source <chr>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>, query <chr>
All what I have is a 1 column, 1 row. I want to search the tweet from that text.
Ahh all you need then is
rt <- search_tweets(Tweet2[1, 1])
Or
rt <- search_tweets(Tweet2[[1]])
It searches, but I get empty results and this message:
Searching for tweets...
Finished collecting tweets!
Warning message:
In chars[!ok] <- unlist(lapply(chars[!ok], encode)) :
number of items to replace is not a multiple of replacement length
Either the value in your data frame isn't making a good query or, more likely, you've run into a bug in an older version. Try installing the latest version:
detach("package:rtweet")
if (!"devtools" %in% installed.packages()) {
install.packages("devtools")
}
devtools::install_github("mkearney/rtweet")
You may have to restart your session--sometimes it throws a fit when you install during a session that's already been using rtweet.
If you get the same message, then it'll probably be your data, but I bet the message goes away with an update.
I am getting this now Error: is.atomic(q) is not TRUE
And whats the latest rtweet version? 0.4.8?
Yes, that's the most recent version.
Can you post your code? It's hard for me to figure out what exactly is going on with just the error message.
One potential problem would be if you have strings as factors:
> x <- data.frame(a = "\"rstats\"")
> search_tweets(x[1, 1])
Error in nchar(q) : 'nchar()' requires a character vector
> x[1, 1]
[1] "rstats"
Levels: "rstats"
> x <- data.frame(a = "\"rstats\"", stringsAsFactors = FALSE)
> x[1, 1]
[1] "\"rstats\""
> search_tweets(x[1, 1])
Searching for tweets...
Finished collecting tweets!
# A tibble: 100 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888533131707301888 2017-07-21 22:56:21 14639660 FatPlatypus
2 888533024463343618 2017-07-21 22:55:55 215035672 beeonaposy
3 888532553950310400 2017-07-21 22:54:03 13074042 juliasilge
4 888532189868118016 2017-07-21 22:52:36 1096058449 itatiVCS
5 888531807175491585 2017-07-21 22:51:05 2944647704 andrew_benesh
6 888530848399503360 2017-07-21 22:47:17 848804341880242176 indra_eko3
7 888530831819628544 2017-07-21 22:47:13 347261357 DaveRubal
8 888530779470520325 2017-07-21 22:47:00 69460360 YvesMessy
9 888530620506394625 2017-07-21 22:46:22 560431626 antuki13
10 888530334643625987 2017-07-21 22:45:14 1544327005 bigboardsio
# ... with 90 more rows, and 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <chr>, reply_to_user_id <chr>,
# reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
As for your specific error message, it says you're not supplying an atomic vector as your query, which means you're probably supplying the data frame.
x
is the data frame from above. It only comes back FALSE if it's left as a data frame.
> is.atomic(x[[1]])
[1] TRUE
> is.atomic(x[1, 1])
[1] TRUE
> is.atomic(x[1, ])
[1] TRUE
> is.atomic(x[, ])
[1] TRUE
> is.atomic(x)
[1] FALSE
Probably the safest thing to do would be to specify the column, the first observation, and use as.character()
.
## variable name. first obs. as character.
rt <- search_tweets(as.character(x[[varname]][1]))
## first column. first row. as.character.
rt <- search_tweets(as.character(x[[1]][1]))
Here is the full code:
`
#Extracting retweeter details
A <- search_tweets("#dayofrage", n=1000)
#Extract tweets that do not include RT in the text
A11<- A[!grepl("RT @", A$text),]
#Take tweets that have between 10 and 15000 retweets
AAA<- subset(A11, retweet_count >= 10 & retweet_count < 100)
#Selected tweet with maximum number of retweets
ABB <- AAA[which.max(AAA$retweet_count),]
#Find the user who tweeted the tweet
User <- as.data.frame(ABB[4], drop=FALSE)
Tweet <- as.data.frame(ABB[5], drop=FALSE)
TweetPrint <- print(Tweet[1,1])
Tweet2 <- data.frame(TweetPrint)
rt <- search_tweets(Tweet2[1])
`
What I am trying to do is, to look for tweets with the hashtag #Dayofrage
and then search for tweets that received between 10 to 100 retweets, and see if the tweeter has less than 75,000 followers. IF so I want the details of that tweet to search for the retweeters.
Here's what I would do. First, to get the data you're looking for (tweets using #dayofrage
hashtag from users with fewer than 75k followers).
library(rtweet)
library(dplyr)
## search for day of rage tweets
dor <- search_tweets("DayofRage", n = 18000)
## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
users_data() %>%
unique() %>%
right_join(dor) %>%
filter(!is_retweet) %>%
dplyr::select(status_id, retweet_count, followers_count, text)
## filter using both conditions
dat <- dat %>%
filter(retweet_count >= 10 & followers_count < 75000)
Soon, you'll be able to use statuses_retweets()
(or something similarly named) to get data directly on the retweeters. But I have only added the skeleton of that new API call, so it's not quite ready yet.
For now, you can take part of the text of a tweet and use search_tweets()
to get the retweet data. If it's a long tweet, or if it includes certain symbols it seems, passing the entire text of a tweet to search_tweets()
only seems to return the original tweet. This is mostly because retweets get truncated (adding the "RT: " at the beginning of the string can push retweets over the otherwise observed 140 character limit, so retweets are often truncated in the API calls). But using only part of the tweet seems to work great:
> ## text of first tweet
> dat$text[1]
[1] "A perfect depiction of what happened at the #DayOfRage protests - #Israel didnt want Muslims praying. FULL STOP \nhttps://t.co/TDlfb8mpwA https://t.co/bhKspzufsH"
>
> ## search for part of tweet
> rts <- search_tweets("A perfect depiction of what happened at the #DayOfRage protests")
Searching for tweets...
Finished collecting tweets!
> rts
# A tibble: 13 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888759971831136256 2017-07-22 13:57:44 297242407 178kakapo
2 888753150085263360 2017-07-22 13:30:37 789570670510403584 therkut1
3 888748221031477248 2017-07-22 13:11:02 781224213684387840 Lougris2
4 888746029952192512 2017-07-22 13:02:20 448086598 AnonymoonKheir
5 888745050649350148 2017-07-22 12:58:26 568828814 rico_hands
6 888738686581583872 2017-07-22 12:33:09 10903242 aliakcay
7 888734864396124161 2017-07-22 12:17:58 1527652524 ahmed_mekhamer
8 888734105197662208 2017-07-22 12:14:57 316113260 A7medHakami
9 888732239026159616 2017-07-22 12:07:32 1605509936 DrMAMMohamed
10 888730370522333184 2017-07-22 12:00:06 873317130 20tree9
11 888730293577826304 2017-07-22 11:59:48 723325134 moonicegang
12 888730262489747456 2017-07-22 11:59:41 3301703894 nazem239
13 888729925724844036 2017-07-22 11:58:20 81136269 MiddleEastMnt
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
Of course the downside is this last part of selecting which part of the tweet to use as the query would be difficult to automate. Assuming the retweets()
APIs get up and running soon, though, this shouldn't be a problem for much longer!
Okay the top part looks perfect, and its running perfect. Is there a ways were I can then search for the retweeters? Like extract the first 6 words from the retweet, and add
RT @...: Tweet
If this works then it solves my whole issue.
I don't mind doing this for the whole list in dat
That's a good question. It may be enough to just take the first few words-- you can always inspect later to see if the text of the tweets is the same for all tweets.
The code below will automate the process. Searching for the first 8 or so words works (see data1
). And, I wasn't sure it would work, but your idea of adding RT to the beginning of each string did as well (see data2
)!
You can see the number of returned observations differs slightly. I'm guessing your RT approach is the way to go. In fact, I'm probably going to steal this idea and use it for other things as well :).
> x <- lapply(strsplit(dat$text, " "), "[", 1:8)
> x <- lapply(x, na.omit)
> x <- vapply(x, paste, collapse = " ", character(1))
> x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
> x <- sapply(x, URLencode, USE.NAMES = FALSE)
> ## first 8(ish) words of each tweet
> data1 <- lapply(x[1:10], search_tweets, verbose = FALSE)
> ## include explicit "RT:" at beginning
> data2 <- lapply(paste("RT: ", x[1:10]), search_tweets, verbose = FALSE)
> ## compare N of obs for each method
> sapply(data1, nrow)
[1] 10 16 9 10 15 13 13 100 28 25
> sapply(data2, nrow)
[1] 8 15 7 8 13 11 9 100 27 24
> ## preview RT method data
> data2
[[1]]
# A tibble: 8 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888789521445289984 2017-07-22 15:55:09 49804099 moniqueb54
2 888787439816515587 2017-07-22 15:46:53 3248144099 ArmingAttano
3 888781441785528320 2017-07-22 15:23:03 77715831 LauriersRoses
4 888778665743589376 2017-07-22 15:12:01 111381605 framboazz
5 888778614803705856 2017-07-22 15:11:49 2612041957 Mariableuee
6 888777605910069250 2017-07-22 15:07:48 848689021995896832 Tiberdanie93
7 888777300019474433 2017-07-22 15:06:35 3245309031 AidGarRmnc
8 888776293365559296 2017-07-22 15:02:35 2938109319 OPHIUSE
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[2]]
# A tibble: 15 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888779036117405696 2017-07-22 15:13:29 103537650 samanello
2 888770017378095104 2017-07-22 14:37:39 2267411377 SaMioN777
3 888769517287075840 2017-07-22 14:35:40 952494301 dnoblesolja
4 888764962889494529 2017-07-22 14:17:34 110065142 captain_amarito
5 888760219446235136 2017-07-22 13:58:43 62322497 ratcatcher2
6 888759097088122880 2017-07-22 13:54:15 2897026934 adi_marji
7 888758110118170625 2017-07-22 13:50:20 41994475 olajam
8 888758004820004864 2017-07-22 13:49:55 2959660574 hanaandidi
9 888757702981066753 2017-07-22 13:48:43 871766034214211584 MuslimFrom
10 888757425884418052 2017-07-22 13:47:37 782493620 wnwinjz
11 888757309777879041 2017-07-22 13:47:09 828618053462196224 NADALRAFAT1
12 888757129049460736 2017-07-22 13:46:26 958083241 LJ_Brodigan1
13 888757113094377474 2017-07-22 13:46:22 843492310352642048 KhalidM12094546
14 888756807732166657 2017-07-22 13:45:09 2580813590 ElaineNiddery
15 888756646700351488 2017-07-22 13:44:31 219336060 jairogarciacol
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[3]]
# A tibble: 7 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888769483120283648 2017-07-22 14:35:31 41722795 MoniMatos
2 888760938345713665 2017-07-22 14:01:34 537210019 zrbar
3 888755751069466625 2017-07-22 13:40:57 608510310 Rayisray42
4 888751302729240576 2017-07-22 13:23:17 1202276508 Palosypala
5 888750082304073728 2017-07-22 13:18:26 103681269 PalestineFacts
6 888750079430885376 2017-07-22 13:18:25 88183759 itv5
7 888749234345959425 2017-07-22 13:15:04 1551994398 JJohnexley46
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[4]]
# A tibble: 8 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888779979823210497 2017-07-22 15:17:14 2663970236 TheRedTherapist
2 888751797862629376 2017-07-22 13:25:15 722498830307102720 moqolof
3 888749815462580224 2017-07-22 13:17:22 430055091 M_Beaman
4 888749654824951808 2017-07-22 13:16:44 3020018765 BenjaminKweskin
5 888749504853397504 2017-07-22 13:16:08 883808086657552385 lukestevo91
6 888749244064124928 2017-07-22 13:15:06 1468290938 naz548
7 888746415824015360 2017-07-22 13:03:52 421918538 Chordiegurl
8 888746179370000384 2017-07-22 13:02:55 16568637 marcynewman
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[5]]
# A tibble: 13 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888786153196838913 2017-07-22 15:41:46 49766803 RLZaki
2 888778871184785413 2017-07-22 15:12:50 103537650 samanello
3 888759971831136256 2017-07-22 13:57:44 297242407 178kakapo
4 888748221031477248 2017-07-22 13:11:02 781224213684387840 Lougris2
5 888746029952192512 2017-07-22 13:02:20 448086598 AnonymoonKheir
6 888745050649350148 2017-07-22 12:58:26 568828814 rico_hands
7 888738686581583872 2017-07-22 12:33:09 10903242 aliakcay
8 888734864396124161 2017-07-22 12:17:58 1527652524 ahmed_mekhamer
9 888734105197662208 2017-07-22 12:14:57 316113260 A7medHakami
10 888732239026159616 2017-07-22 12:07:32 1605509936 DrMAMMohamed
11 888730370522333184 2017-07-22 12:00:06 873317130 20tree9
12 888730293577826304 2017-07-22 11:59:48 723325134 moonicegang
13 888730262489747456 2017-07-22 11:59:41 3301703894 nazem239
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[6]]
# A tibble: 11 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888752844492472321 2017-07-22 13:29:25 2267411377 SaMioN777
2 888750364597657600 2017-07-22 13:19:33 2738627253 UzairVedachhia
3 888745652481642496 2017-07-22 13:00:50 169626752 MushyMelbowHead
4 888745072824659970 2017-07-22 12:58:32 568828814 rico_hands
5 888730702581084160 2017-07-22 12:01:25 873317130 20tree9
6 888728692402335744 2017-07-22 11:53:26 473851713 ButcherMartin
7 888728091916369921 2017-07-22 11:51:03 250610814 MunaAmina
8 888727617406468097 2017-07-22 11:49:10 723325134 moonicegang
9 888725176183136256 2017-07-22 11:39:28 2725537397 saifi_inam
10 888724023168860160 2017-07-22 11:34:53 712779203863818241 madeniexdan1
11 888723387517894660 2017-07-22 11:32:21 2717291006 NahimShah1
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[7]]
# A tibble: 9 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888750768802504705 2017-07-22 13:21:10 280391708 1ncamera
2 888746347985285120 2017-07-22 13:03:36 745259762196037632 vincemalumbono2
3 888745085512364032 2017-07-22 12:58:35 568828814 rico_hands
4 888721779988058113 2017-07-22 11:25:58 235893581 Palkomitee
5 888720487903318019 2017-07-22 11:20:50 2964861784 SlimaniKheira
6 888718429661913089 2017-07-22 11:12:39 44657691 Nazeera_L
7 888718277106708481 2017-07-22 11:12:03 219630174 khosseni
8 888717687672713220 2017-07-22 11:09:42 347951637 musulmanfrance
9 888717001333633024 2017-07-22 11:06:59 843492310352642048 KhalidM12094546
# ... with 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[8]]
# A tibble: 100 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888798186411663360 2017-07-22 16:29:35 404004591 biodivan
2 888787611438985216 2017-07-22 15:47:34 81850406 jibran11
3 888784441182756864 2017-07-22 15:34:58 49766803 RLZaki
4 888782769974681601 2017-07-22 15:28:19 364082745 iizajun
5 888782701284741121 2017-07-22 15:28:03 730931066 Haych_K
6 888781750159249409 2017-07-22 15:24:16 1599580136 Arwa_A_11
7 888778809562017793 2017-07-22 15:12:35 409762432 frawlaqmor
8 888776520298143745 2017-07-22 15:03:29 766394253706686464 matt_ogdie
9 888776075232333825 2017-07-22 15:01:43 2364544256 AmirbehnamM
10 888776059851866112 2017-07-22 15:01:39 3938910364 nshsfgl
# ... with 90 more rows, and 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[9]]
# A tibble: 27 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888769581094805505 2017-07-22 14:35:55 876374886 fahume
2 888768246975000576 2017-07-22 14:30:37 1901643534 IamaCamera2
3 888768235608317952 2017-07-22 14:30:34 812884772431151104 rrredRaiN
4 888760019491147776 2017-07-22 13:57:55 1041879914 jamesjmcmenamin
5 888757861735583744 2017-07-22 13:49:21 3219202662 nazonorekisi
6 888749527318093824 2017-07-22 13:16:14 545992640 garryowen1888
7 888748498585346048 2017-07-22 13:12:08 325110024 tosh_allan
8 888745942693941248 2017-07-22 13:01:59 123771138 khussh_
9 888739321154383876 2017-07-22 12:35:40 710862930 fabricedelamort
10 888738891217457152 2017-07-22 12:33:58 2677180819 barbarosa69
# ... with 17 more rows, and 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
[[10]]
# A tibble: 24 x 38
status_id created_at user_id screen_name
<chr> <dttm> <chr> <chr>
1 888790484633649152 2017-07-22 15:58:59 1287447402 Cybrarian64
2 888783833612652544 2017-07-22 15:32:33 1599580136 Arwa_A_11
3 888779237431197698 2017-07-22 15:14:17 737608020 CaesarZaccaria
4 888765299075543044 2017-07-22 14:18:54 22644834 Bogsideandproud
5 888762932540178433 2017-07-22 14:09:30 143593029 OceanadeSilva
6 888747364562989056 2017-07-22 13:07:38 837484490 susannefoort
7 888745007783399424 2017-07-22 12:58:16 568828814 rico_hands
8 888741615086575617 2017-07-22 12:44:47 842888141673394178 AquaDeAce
9 888730346224664577 2017-07-22 12:00:01 873317130 20tree9
10 888728460616773634 2017-07-22 11:52:31 839903300250185729 Nordin75V
# ... with 14 more rows, and 34 more variables: text <chr>, source <chr>,
# reply_to_status_id <lgl>, reply_to_user_id <lgl>,
# reply_to_screen_name <lgl>, is_quote <lgl>, is_retweet <lgl>,
# favorite_count <int>, retweet_count <int>, hashtags <list>, symbols <list>,
# media_url <list>, media_t.co <list>, media_expanded_url <list>,
# media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
# ext_media_expanded_url <list>, ext_media_type <lgl>,
# mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
# quoted_status_id <chr>, quoted_text <chr>, retweet_status_id <chr>,
# retweet_text <chr>, place_url <chr>, place_type <chr>, place_name <chr>,
# country <chr>, country_code <chr>, geo_coords <list>, coords_coords <list>,
# bbox_coords <list>
Btw, you may want to set n higher just in case.
df <- lapply(x, search_tweets, verbose = FALSE, n = 1000)
Or whatever number you set it at, just follow up with more searches for anything that hits the cap.
I am a bit confused now. Okay let me explain exactly what I am trying to get:
The current code is searching for hashtag, it finds the right tweets, it finds the right users, but how can I use it to find the retweeters, then download the list of followers of the tweeter? The rest is manageable.
Let me show you how I used to carry out this process but manually:
library(rtweet)
#Extracting retweeter details
A <- search_tweets("RT @DaddyJew: I'm not battling depression.", n=1000)
#Extract the second column from A. Which is the unique user IDs.
B <- as.data.frame(A[2], drop=FALSE)
# Drop any duplicate user names:
C <- unique(B)
#Get followers of a specific user
DaddyJew <- get_followers("DaddyJew", n = 75000, page = "-1", parse = TRUE, token = NULL)
I used to take the retweeters like this, and then download the list, with the followers list. and determine the %. (I used to find the percentage by exporting the data to excel, and use excel functions.)
I want to do this automatically now.
library(rtweet)
library(dplyr)
## search for day of rage tweets
dor <- search_tweets("DayofRage", n = 18000)
## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
users_data() %>%
unique() %>%
right_join(dor) %>%
filter(!is_retweet) %>%
dplyr::select(status_id, retweet_count, followers_count, text) %>%
filter(retweet_count >= 10 & followers_count < 75000)
## get only first 8 words from each tweet
x <- lapply(strsplit(dat$text, " "), "[", 1:8)
x <- lapply(x, na.omit)
x <- vapply(x, paste, collapse = " ", character(1))
## get rid of hyperlinks
x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
## encode for search query (handles the non ascii chars)
x <- sapply(x, URLencode, USE.NAMES = FALSE)
## get up to first 100 retweets for each tweet
data <- lapply(x, search_tweets, verbose = FALSE)
## remove duplicates for original tweeters
tweeters <- unique(dat$user_id)
## this is rate limited so you'll have to construct a for loop with a sleep call
flws <- lapply(tweeters[1:15], get_followers)
## remove duplicates for retweeters
retweeters <- unique(data$user_id)
## this is rate limited so you'll have to construct a for loop with a sleep call
reflws <- lapply(retweeters[1:15], get_followers)
@hussainshehadeh my last post was posted before I looked at your most recent ones. After lunch I'll look at this again though!
Okay the code may be solving my issue, but I am a beginner and finding it difficult to extract what I want. I don't know how to create the loop with a sleep call. Is it possible to show me how?
To make it a bit clear, I just want to get 4 columns at the end that shows the number of followers for the user, the number of retweets received, the percentage, and finally the tweet. (I want like 1,000 results, so I am expecting I need to add the loop with sleep call thing)
I know the code above may be doing what I am asking for, but I do not get how to extract the results, and place them in 4 columns...
I am sorry if I am asking for much.
I actually think you already have everything you need in number 1 and number 2 that I posted above.
You can create a new data frame with those four columns:
df <- data.frame(
user = dat$screen_name,
retweets = dat$retweet_count,
pct_of_followers = dat$retweet_count / dat$followers_count,
tweet = dat$text,
stringsAsFactor = FALSE
)
And save it to CSV / open it in excel
write.csv(df, "tweets_retweeters_percents.csv", row.names = FALSE)
I am getting the following message when I run the df
:
Error in data.frame(user = dat$screen_name, retweets = dat$retweet_count, : arguments imply differing number of rows: 0, 14, 1 In addition: Warning message: Unknown or uninitialised column: 'screen_name'.
Moreover, I am not looking for the percentage of retweets/followers, but rather the percentage of people who retweeted the tweet, and are following the user. The way to achieve this (or how I manually used to gather the %) is by downloading the list of followers, then downloading the list of retweeters, and looking for the overlap. Then divide the number of overlap by the tweets, and find the %.
For example: User X who tweeted A, received 100 retweets. From these 100 retweets, 50 were following him, while the other 50 are not following him. So, 50% is the answer. (I am looking for the % of 1st degree followers, who retweeted the post.)
Sorry, you need to make a quick adjustment. Here's no. 2 from above modified. The resulting output, dat, should have what you want
## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
users_data() %>%
unique() %>%
right_join(dor) %>%
filter(!is_retweet) %>%
dplyr::select(screen_name, retweet_count, followers_count, text) %>%
filter(retweet_count >= 10 & followers_count < 75000)
dat$pct_of_followers <- dat$retweet_count / dat$followers_count
dat
@mkearney I edited my last post
Your last code works fine, but as I mentioned, I am looking for a different percentage that requires the followers list of the tweeter, and the retweeters to find the overlap and calculate the percentage.
It's a little bit complicated because the users could have up to 75,000 followers. A single API call can retrieve 5,000 followers. You get 15 of those every 15 minutes. If they all had fewer than 5,000 followers, you could do something like this:
## vector of users
users <- unique(dat$screen_name)
## initialize output object
flws <- vector("list", length(users))
## execute `i` loops (one for each obs in users)
for (i in seq_along(users)) {
## get follows for user i
flws[[i]] <- get_followers(users[i], n = 5000)
## add user variable
flws[[i]]$user <- users[i]
## every 15th loop, sleep for 15 minutes (unless it's the final loop)
if (i %% 15 == 0L && < length(users)) {
## 60 seconds * 15 = 15 minutes
Sys.sleep(60 * 15)
}
}
Ideally, you'd have control over the exact rate limit.
token <- get_tokens()
if (!inherits(token, "Token")) token <- token[[1]]
## rate limit
rl <- rate_limit(token, "followers/ids")
## number of API calls left
rl$remaining
## number of seconds until rate limit reset
as.numeric(rl$reset, "secs")
I know this doesn't lay it all out for you, but I'm confident you now have all the pieces. I'll hopefully add some support for getting all the followers for multiple users, but that's unlikely to happen in the next few days as I'm moving cities next week. If you have more questions I'll try to respond, but I probably won't be quite as responsive as I have been.
I am trying to use the function search_tweets, but instead of manually inputing the query, I have it in a dataframe. Is there a way were the function can read the text from the dataframe?
Here is my code
Retweeters1 <- search_tweets(Tweet2, n=100)
, were Tweet2 is a dataframe.