Date accepted: 2022-01-31 Submitting Author Name: Bret Staudt Willet Submitting Author Github Handle: !--author1-->@bretsw@jrosen48@maelle<!--end-editor-- Reviewers: @llrs, @marionlouveaux

Due date for @llrs: 2021-04-19 Due date for @marionlouveaux: 2021-04-27

Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: tidytags
Version: 0.1.0
Title: Simple Collection and Powerful Analysis of Twitter Data
Authors@R: c(
    person("K. Bret", "Staudt Willet", , 
      email = "bret@bretsw.com", role = c("aut", "cre"),
      comment = c(ORCID = "0000-0002-6984-416X")
    ),
    person("Joshua M.", "Rosenberg", ,
      role = c("aut"),
      comment = c(ORCID = "0000-0003-2170-0447")
    )
  )
Description: {tidytags} coordinates the simplicity of collecting tweets over time 
    with a [Twitter Archiving Google Sheet](https://tags.hawksey.info/) (TAGS) and the utility of the 
    [{rtweet} package](https://rtweet.info/) for processing and preparing additional Twitter metadata. 
    {tidytags} also introduces functions developed to facilitate systematic yet 
    flexible analyses of data from Twitter.
License: GPL-3
URL: https://bretsw.github.io/tidytags/, https://github.com/bretsw/tidytags
Depends: 
    R (>= 4.0)
Imports:
    dplyr (>= 0.8),
    googlesheets4 (>= 0.2),
    purrr (>= 0.3),
    readr (>= 1.3),
    rlang(>= 0.4),
    rtweet (>= 0.7),
    stringr (>= 1.4),
    tibble (>= 3.0), 
    tidyr (>= 1.0),
    tidyselect (>= 1.0)
Suggests:
    beepr,
    covr,
    ggplot2,
    knitr,
    longurl,
    mapsapi,
    mapview,
    rmarkdown,
    testthat,
    tidyverse,
    urltools,
    usethis
Encoding: UTF-8
VignetteBuilder: knitr
LazyData: TRUE
RoxygenNote: 7.1.0

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [x] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [ ] workflow automataion
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

{tidytags} allows for both simple data collection and thorough data analysis. In short, {tidytags} first uses a Twitter Archiving Google Sheet (TAGS) to easily collect tweet ID numbers and then uses the R package {rtweet} to re-query the Twitter API to collect additional metadata. {tidytags} also introduces new functions developed to facilitate systematic yet flexible analyses of data from Twitter.

Who is the target audience and what are scientific applications of this package?

The target users for {tidytags} are social scientists (e.g., educational researchers) who have an interest in studying Twitter data but are relatively new to R, data science, or social network analysis. {tidytags} scaffolds tweet collection and analysis through a simple workflow that still allows for robust analyses.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

{tidytags} wraps together functionality from several useful R packages, including {googlesheets4} to bring data from the TAGS tracker into R and {rtweet} for retrieving additional tweet metadata. The contribution of {tidytags} is to bring together the affordance of TAGS to easily collect tweets over time (which is not straightforward with {rtweet}) and the utility of {rtweet} for collecting additional data (which are missing from TAGS). Finally, {tidytags} reshapes data in preparation for geolocation and social network analyses that should be accessible to relatively new R users.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [x] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [x] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

:wave: @bretsw @jrosen48! Any update since your meeting? Thank you!

@maelle, we're moving forward slowly but surely. @jrosen48 and I have scheduled regular meetings to work on the updates together.

Thank you for the updates! Do not hesitate to ask for help if needed.

:wave: @bretsw @jrosen48! Any new update on your progress?

We're getting there! Fixed all the VCR and testing issues and have updated quite a bit of the requested explanations. Still a little bit to do, but we're making significant progress.

Awesome, thanks for the update!

@maelle, thank you for continuing to check in, and thank you for bearing with us!

:wave: @bretsw @jrosen48! Any new update on your progress? :smile_cat:

@maelle, we are so close to being done!!! All the code has been updated. We just have a few more place to refine explanations in the introductory text. I'm hoping to submit our response to reviewers in the next couple of days.

@maelle, @llrs, and @marionlouveaux, thank you for you patience. Despite our silence, @jrosen48 and I have been hard at work this fall making the requested updates. We are ready for you to take a look, whenever it is convenient for you all (acknowledging that it has taken us a very long time on our end). Thank you for your comments and suggestions; {tidytags} is a much better package now because of them.

I'll leave our complete response in the next comment.

Reviewer 1 Comments

VIGNETTES

The setup vignette is not clear what steps are necessary and which not, I would suggests adding titles and an index.

Thank you for this feedback. In response, we have noted that the first three key tasks (previously pain points) are required, while the fourth key task is optional and required only for geocoding. We are amenable to adding titles and an index; can you help us to understand what you mean by these? We think there is a title for the “Getting started with tidytags” vignette and there are also section headings for the vignette.

The vignettes have a "Pain point #4" which I couldn't find referenced anywhere. Also, perhaps using the more descriptive title would make it easier for users to know what it is about (Removing the Pain point reference entirely of the title of the section). Not that they aren't pain points but just redirect users when needed to the solutions/documentation as they go.

Thank you for pointing out this oversight. We mistakenly did not list all four of the key tasks (previously pain points), instead listing only three. We have corrected this in the vignette. Also, as noted here in and in our response to the above feedback, we have renamed the “pain points” key tasks to more accurately describe their nature.

However, most of the code chunks of the vignette are not run (as reported by BiocCheck):

Thank you for pointing this out. We acknowledge this but note that this is a function of the nature of this vignette on getting started with the package. We also note that there are extensive tests that ensure that the functionality of the package is sound, and so did not yet make any changes in response to this feedback.

And of those run are adding documentation or set up of the vignette. Perhaps some kind of setup specific for the vignettes could be used, otherwise they defeat their purpose and turn into plain READMEs. (I know it is not easy for CRAN, so maybe set them up as articles just on the website but outside CRAN?)

The vignette does compile on real data, it has just been precomputed following this guide: https://ropensci.org/blog/2019/12/08/precompute-vignettes/. This means that we do not have to wait for a lengthy compilation each time we rebuild the website.

To create the google API key step is not clear enough (perhaps a redesign on API configuration interface?). An indication to use Google Sheet API to the question "Find out what kind of credentials you need?", would be helpful.

Thank you for this feedback. Indeed, the process for obtaining a key for the Google Sheets API had changed. We have a) updated this section of the Getting Started vignette and b) added screenshots to facilitate the process of creating and storing (within R/RStudio) this key.

It should be pointed out that OpenCage Geocoding API key is not needed to use the package. Also the discussion about the price and API limits might be good for an issue but doesn't fit well on the vignette (I've seen that @maelle asked for this, but now that the package is settle in Open Cage maybe it is no longer needed or it can be reduced).

Thank you for pointing this out. As noted above, we have labeled this as optional and required only for geocoding.

On the chunk about "dplyr::glimpse(example_after_rtweet)" I get a different result 2204 rows compared to the 2,215 reported on the vignette.

Thank you for pointing this out. This is likely because some users have deleted their accounts or tweets in the period in between us creating this vignette and your access to the data. We have added the following note to the vignette:

“Note that this number of rows is at the point at which we collected the data; the number may differ when you carry out this same search.”

When I run the following code chunk I get an error (as I don't have the package longurl yet): example_domains <- get_url_domain(example_urls)

Thank you for pointing this out. We have added this note to the vignette: “Please note that these two packages are required to use this function (but they are not installed automatically when {tidytags} is installed.”

Before using a package in Suggests, it should be tested if they can be loaded (you can use rlang::is_installed(longurl) or requireNamespace("longurl", quietly = TRUE) ).

Thank you for this suggestion. We have made an addition above where the function requiring {longurl} is used to test whether it can be loaded (and to share with users how to install the package if it is not).

Last, I don't know how to push data to the google sheet TAGS created on the first vignette.

Thank you for pointing this out. We should have clarified that our use of Google Sheets is unidirectional: We use TAGS to access data, but process the data in R, but do not plan for users of the package to update the TAGS sheet with the processed data. We added a note to the vignette describing this and noting that users could push data to the Google Sheet or store it as a CSV (among other options).

FUNCTION DOCUMENTATION

Name of the function repeated on the description on add_users_data, I think it is not needed.

Thank you for pointing this out; we have removed the name of the function from the description.

EXAMPLES

Examples cannot be run without the authentication setup and there is no mention of this on the help pages. Perhaps a minor comment will remind users.

We have added a brief comment to the documentation for the following functions: read_tags(), pull_tweet_data(), lookup_many_tweets(), geocode_tags(), get_upstream_tweets(), add_users_data()

This is the comment we added:

' @details This function requires authentication; please see

' \code{vignette("setup", package = "tidytags")}

COMMUNITY GUIDELINES

There isn't any BugReports but on the vignettes there is info about how to get help. Would suggest to add the issues link on the description too. The contributing file is extensive and well organized.

Thank you for pointing this out. We have added a BugReports field to the DESCRIPTION.

paper.md

STATEMENT OF NEED

The authors say: "Yet, many approaches to collecting social media data in the moment require important technical skill that may dissuade social scientists from getting started". However this packages requires authentication in 2 or 3 API and a special google sheets. I have some doubts tidytags will attract the target audience.

Thank you for pointing this out. We have made several changes to address this comment. First, we have changed some of the language we use to justify the need for the package to not overstate the degree to which the package is accessible to novices (though, as we note next, we think the package is fundamentally accessible to relative newcomers to R). Second, we have added extensive documentation-many of these additions in response to reviewer feedback-especially in the ‘Getting started with tidytags’ vignette as well as in the paper accompanying the package. Lastly, while we have made the two aforementioned changes and thank the reviewers for these suggestions that have improved our package, we note that we still think the package can be used by those relatively new to R in a way that is easier than other extant approaches for accessing Twitter data.

REFERENCES

None of the references have dois or urls to online resources.

We have added DOI URLs for each reference in our .bib file.

There's an additional " on the yaml heading of paper.md that prevented viewing the paper.

Thank you for catching this typo. We have corrected it.

TESTS

I got 2 tests that failed and 3 warnings (besides 2 that were skipped). The test-get_url_domain.R:14:5 test reported domain4 not equal to "npr.org", on the browser I get asked for cookie consent on the browser, when run locally, outside testthat or vcr, I get the url of choice.npr.org.

This a persistent bug in {vcr} that turned out to be tricky to resolve (see https://github.com/ropensci/vcr/issues/220). We have removed the two skipped tests that depended on this.

The other failing test are weird (as I don't get them when I run them on the R console but only on the build/check Rstudio panel).

We have updated all tests, and all are currently passing locally and in GitHub CI.

I have a development version installed of vcr and one of the warnings is related to it. The new version warns when the cassettes are empty, this in my experience means that the test it not conclusive, but this could also be related to not having the geo code API enabled.

We have not seen an error like this on our end.

The other warnings are on test-get_url_domain.R, lines 3 and 32, Invalid URL I'm not sure why, because when I paste on my browser I get redirected to https://www.aect.org/about_us.php. (BTW perhaps the link can be changed to https instead of http).

This is specifically related to the error in 1.16, a persistent bug in {vcr} that turned out to be tricky to resolve (see https://github.com/ropensci/vcr/issues/220). We have removed the two skipped tests that depended on this.

ADD REVIEWER

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

We have added name and ORCID to DESCRIPTION file (https://github.com/llrs).

REVIEW COMMENTS

The package rtweet is experimenting drastic changes (I'm involved on rtweet maintenance) and there will be a major release with breaking changes. Probably it will break this package (the recommendation about the token will change for instance), so be ready to update it accordingly.

At the time of our revisions (November 2021), {rtweet} has not yet been updated; it is still version 0.7.0.

The package contains relative few simple functions that provide more data from twitter or make it easier to analyze it. I have not analyzed Twitter data (except for an overview of a user account), so don't know how useful the data is. I am not a user of TAGs but I'm a bit puzzles how to add information to the google sheet: if I'm a new user how should I do it? I mean I can get the template but how I fill it? I think this package would be easier for non technical people if it included a function to add the information gathered via rtweet or processed with the package back to the original google sheet.

Please see our response to comment 1.9.

Haven't fully read the paper.md for JOSS but I think it is short enough and comprehensive of the package.

From a more technical point of view, I have some comments about the code and the package:

There are 75 lines longer than 80 characters, try to reduce them. Probably it is just a matter of style and perhaps creating new shorter variables

Thank you for catching this. We have corrected all lines longer than 80 characters and checked with goodpractice::goodpractice().

Also namespaces in Imports field not imported from: ?gargle? ?readr?. All declared Imports should be used.

We have removed gargle from the Imports list. The gargle package is mentioned in the “Getting started with tidytags” vignette, but the package is not called anywhere in tidytags. However, because readr is used extensively throughout the testthat scripts, so we have moved readr to the Suggests list.

The get_char_tweet_ids function could be improved, with only one argument if it is a data.frame then extract the status_id and get the ID via id_str. If it is an url you can just extract the last numbers with gsub("https?\\://twitter.com\\/.+/statuses/", "", df$status_url), no need to modify the data.frame and then extract the vector again.

Thank you for this suggestion. We have updated the get_char_tweet_ids() function and shortened the code significantly.

On process_tweets you can simplify the is_self_reply to ifelse(.data$is_reply & .data$user_id == .data$reply_to_user_id, TRUE, FALSE).

Thank you for this suggestion. We have updated the function accordingly.

On get_upstream_replies the examples are not informative, as there are no replies to get data from on the example dataset. You make multiple calls to pull_tweet_data, some of them might be unnecessary. The process_tweets can be called just once at the end instead of multiple times and on each loop run. This should speed up the process. Also, if there are at most 90000 tweets taken from each run, then you can estimate the number of iterations needed and inform the user. This might make the wait easier. Perhaps it would be better to use lookup_many tweets as it does a similar process. However, users might hit the rate limit and I don't see any information being passed to the user regarding this.

Thank you for pointing out this oversight. We have updated the example for get_upstream_replies() to include replies in the example dataset. We also updated the name of this function to get_upstream_tweets() to more accurately describe the returned data (i.e., not all the tweets retrieved from “upstream” the reply thread will also be replies; some will just be regular tweets).

We also changed the function so that process_tweets() is not called at all.

To make the code more readable, we created an internal function flag_unknown_upstream() and added tests.

Finally, we added a note in the function documentation that it is impossible to know in advance how far “upstream” you can trace back a reply thread, and that running get_upstream_tweets() might take a while.

Looking at create_edgelist, it calls process_tweets and also get_replies, get_retweets, get_quotes, get_mentions which they call process_tweets too. Perhaps some internal functions could be created to avoid calling process_tweets multiple times on the same data.

We have changed the names of these functions so that they do not mask functions imported from {rtweet}. For instance, get_replies() is now create_replies_edgelist().

Rather than creating an internal function, we simply removed the redundant process_tweets() that occurred within create_edgelist(). We do need to keep process_tweets() within the create_ functions because they can serve as standalone functions, not just internal functions for the more expansive create_edgelist().

In addition, we have updated create_edgelist() altogether to accept an input parameter of type. See comments 2.9 and 2.47 for more details.

Reviewer 2 Comments

README

In Overview of the README, I would add a sentence to explain what is TAGS, and explain a bit more in detail how rtweet and opencage are used in tidytags and what else is provided by tidytags.

Thank you for pointing this out. We have made several revisions to the Overview section of the README to better introduce the roles of TAGS, rtweet, and opencage. The section now reads as follows:

The purpose of {tidytags} is to make the collection of Twitter data more accessible and robust. {tidytags} retrieves tweet data collected by a Twitter Archiving Google Sheet (TAGS), gets additional metadata from Twitter via the {rtweet} package, and from OpenCage using the opencage package, and provides additional functions to facilitate a systematic yet flexible analyses of data from Twitter. TAGS is based on Google spreadsheets. A TAGS tracker continuously collects tweets from Twitter, based on predefined search criteria and collection frequency.

In short, {tidytags} first uses TAGS to easily collect tweet ID numbers and then uses the R package {rtweet} to re-query the Twitter API to collect additional metadata.

{tidytags} also introduces functions developed to facilitate systematic yet flexible analyses of data from Twitter. It also interfaces with several other packages, including the opencage package, to geocode the locations of Twitter users based on their biographies.

Two vignettes illustrate the setup and use of the package:

EXAMPLES

Except for lookup_many_tweets

Thank you for catching this; we have added examples to lookup_many_tweets(). Note that we have not included an example of a dataset of more than 90,000 tweets because this would be prohibitively large.

COMMUNITY GUIDELINES

Missing BugReports

Thank you, we have added this to the DESCRIPTION.

FUNCTIONALITY

7 tests out of 69 failed.

See our response to comments 1.16-1.19. We have updated all tests and found that all are passing for us locally and in GitHub CI.

ADD REVIEWER

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

We have added name and ORCID to DESCRIPTION file (https://github.com/marionlouveaux).

REVIEW COMMENTS

The {tidytags} package gives the possibility to read a TAGS tracker, which is a Google app continuously collecting tweets from Twitter, based on predefined search criteria and collection frequency. It provides wrappers to {rtweet} and {opencage} functions to simplify the retrieval of metadata, either not fetched by TAGS or not existing in Twitter (in the case of geocoding). In addition, it provides functionalities to compute additional descriptive variables about the collected tweets and to visualise relationships between tweets. The {tidytags} package interacts with 3 APIs (Google spreadsheets, Twitter and OpenCage) and one Google app (TAGS). For this reason, the set up is a bit long and tedious when done from scratch. The package itself contains a small number of functions, that are well documented.

Thank you for this feedback. We received similar feedback about the setup from the other reviewer, as well, and have made some changes to the setup, namely, emphasizing what is strictly necessary to do and what is optional for particular functionality (see Reviewer 1, Comment and Response 1). Also, as noted in a response to a comment below (18), we have added a checklist to facilitate the process of getting started with the package.

I used the {pkgreviewr} from RopenSci to conduct my review (a big thanks to the authors of this package). I configured TAGS and created a Gooogle API key. I already had the configuration for {rtweet} and {opencage}. I could run {tidytags} on my own TAGS tracker (and it worked!).

My main comments concern:

the description of the package goals. I didn't get a clear idea of what {tidytags} is doing at the first glimpse on the README. Some more sentences in the Overview paragraph would help. In addition, I would recommend adding a sketch with the main functions and the link with the different APIs. I would also add a checklist of all the things you need to have in order to be set up (more precise than the 4 pain points).

Thank you for pointing this out. We have added information from the paper to the Overview section of the README (as also noted in response to comment 1).

simplification of code for the functions get_replies, get_quotes, get_retweets and get_mentions. I would create an internal get_content function that takes as input parameters df and type, type being "reply", "quote", "retweet" or "mentions".

We have removed these four specialized functions. Instead, we created a new function filter_by_tweet_type() with input parameter type that can be set to reply, retweet, quote, or mention. We then updated create_edgelist() to call filter_by_tweet_type(), meaning that create_edgelist() accepts an input parameter type that can be set to reply, retweet, quote, or mention, but type defaults to “all”.

failing tests: 7 tests out of 69 failed. They are all related to vcr cassettes. These tests pass if I delete the fixtures folder prior to running tests.

See our response to comments 1.16-1.19 and 2.4. We have updated all tests and found that all are passing for us locally and in GitHub CI.

TEST INSTALLATION

Local installation took several minutes (approx. 3 to 5 minutes) because there is many dependencies. On my machine, it had to install 35 packages
3 dependencies: ‘cli’, ‘lubridate’, ‘pillar’

RUN CHECKS

Recommendation: In Contributing.md, remind potential contributors to follow the Getting started with tidytags guide before proceeding to a check on the package. Without the API keys, it doesn’t work.

Thank you for this suggestion - we have made this addition.

7 failed tests, all related to vcr. These tests pass if I delete the fixtures folder. NB: the errors message contain my secret tokens for Twitter, so I removed most of the URLs and replaced it by “………..”.

See our response to comments 1.16-1.19, 2.4, and 2.10. We have updated all tests and found that all are passing for us locally and in GitHub CI.

An HTTP request has been made that vcr does not know how to handle: POST https://api.twitter.com/1.1/statuses/lookup.json?id=...............

See our response to comments 1.16-1.19, 2.4, 2.10, and 2.14. We have updated all tests and found that all are passing for us locally and in GitHub CI.

We are not sure why this request is throwing and error for you. As a note, the string following “lookup.json?id=...” contains the tweet IDs, not API keys.

GOODPRACTICE

In all your functions (.R files) and in the tests listed below, the package {goodpractice} detected long code lines (above 80 characters).

Test files with long code lines

tidytags/tests/testthat/test-geocode_tags.R
tidytags/tests/testthat/test-get_char_tweet_ids.R
tidytags/tests/testthat/test-get_url_domain.R
tidytags/tests/testthat/test-lookup_many_tweets.R
tidytags/tests/testthat/test-pull_tweet_data.R

Thank you for checking this. We have corrected all long code lines. (See comment 1.24)

README

Instead of “Simple Collection and Powerful Analysis of Twitter Data”, I would write “Simple Collection and Powerful Analysis of Twitter Data collected with TAGS”.

We have updated the subtitle of the package to be “Importing and Analyzing Twitter Data Collected with Twitter Archiving Google Sheets”.

In Overview, I would add a sentence to explain what is TAGS. For instance, I would write: {tidytags} retrieves tweet data collected by a Twitter Archiving Google Sheet (TAGS), gets additional metadata from Twitter via the {rtweet} package, and from OpenCage using the opencage package, and provides additional functions to facilitate a systematic yet flexible analyses of data from Twitter. TAGS is based on Google spreadsheets. A TAGS tracker continuously collects tweets from Twitter, based on predefined search criteria and collection frequency." and I would add a link to the vignettes directly there.

Thank you for these suggestions. We have made extensive changes to the ‘Overview’ section of the README in response to this comment and other comments requesting additional detail about the package. Also, we note these changes in our response to comment 1 (above).

In the Setup section, in addition to linking to the Getting started vignette, I would add a checklist of what should be set up in the end, like this:

"To use tidytags at its full capacity, you should have the following things set up:

Google account
Twitter account
TAGS configuration (copy TAGS to your Google account, set access to Twitter, publish the TAGS tracker to the web, share the spreadsheet TAGS with anyone with link) or use the TAGS tracker already in place
Google API key : set environment variable GOOGLE_API_KEY in .Renviron
(optional) rtweet API key TWITTER_PAT to .rtweet_token.rds file in .Renviron
(optional) opencage account and OPENCAGE_KEY in .Renviron

Thank you for this suggestion. We have added this to the README:

To use tidytags at its full capacity, you should have the following things set up:

Create a Google account
Create a Twitter account
Configure TAGS (copy TAGS to your Google account, set access to Twitter, publish the TAGS tracker to the web, share the spreadsheet TAGS with anyone with link) or use the TAGS tracker already in place
Add a Google API key: set environment variable GOOGLE_API_KEY in .Renviron
Add an rtweet API key TWITTER_PAT to .rtweet_token.rds file in .Renviron
Optional: Add an opencage account and OPENCAGE_KEY in .Renviron

In Getting help, there is a typo: “You may also wish too try some general troubleshooting strategies:”

Thank you for catching this typo. We have corrected it.

In Considerations related to ethics…, on word is missing: “In short, please remember that most (if not all) of the you collect may be about people” and duplicated sentence: “{tidytags} should be used in strict accordance with Twitter’s developer terms.”

Thank you for catching these typos. We have corrected them.

In addition to that, RopenSci development guide about readme section suggests to add a Brief demonstration usage directly in the README, which is missing here.

We have added a brief demonstration of the two core tidytags functions, get_tags() and pull_tweet_data(). Beyond these two functions, we point users to our detailed vignettes.

It also encourages to add a paragraph about how to cite the package, which is also missing. I tried citation(package = "tidytags").

Thank you for catching this. We have added a properly formatted CITATION file.

It gives a warning because there is no date field in description.

We are likely just going to have to live with this warning for now, until the CRAN version of tidytags is released. I looked into this extensively, but this comment seems to sum up the prevailing sentiment on not dealing with this warning: https://github.com/r-lib/usethis/issues/806

In the end, the RopenSci development guide also suggests to organize the badges on the README in a table, when you have many badges, which, in my opinion, is the case here.

We have reorganized the badges into a table, following the example of the {drake} package (https://docs.ropensci.org/drake/).

CONTRIBUTING

Typo in Non-technical contributions to {tidytags}: “Both question askers and question answerers are welcome contibrutors!”

Thank you for catching this typo. We have corrected it.

This is a great sentence that I would copy to the README (and maybe the Getting started vignette): “To test the {tidytags} package, you can use an openly shared TAGS tracker that has been collecting tweets associated with the AECT 2019 since September 30, 2019. This is the same TAGS tracker used in the Using tidytags with a conference hashtag vignette.”

Thank you, we have added this sentence under the “Usage” header in the README.

Rather than “We don’t want you to spend a bunch of time on something that we don’t think is a good idea.”, I would write “We don’t want you to spend a bunch of time on something that we don’t think is a real problem or an appropriate solution”.

Thank you for this suggestion. We have updated our language to match your suggestion.

DESCRIPTION

I think that you could remove: gargle, covr, roxygen2, tidyverse, usethis, webmockr.

We addressed similar concerns in response to comment 1.25. We have removed {covr}, {gargle}, {roxygen2}, {tidyverse}, {usethis}, and {webmockr} from DESCRIPTION.

DOCUMENTATION

Documentation of create_edgelist, get_quotes, and get_replies: Typo in “See Also Compare to other tidtags functions such as get_replies(), get_retweets(), get_quotes(), and get_mentions().”

Thank you for catching these typos. We have corrected them.

Documentation of get_mentions: same name as a function from {rtweet}. The RopenSci documentation guide says “If there is potential overlap or confusion with other packages providing similar functionality or having a similar name, add a note in the README, main vignette and potentially the Description field of DESCRIPTION. Example in rtweet README, rebird README.” If possible, I would even change the name of these functions (for instance, tt_get_mentions).

We addressed a similar concern in our response to comment 1.29. We have changed the names of these functions so that they do not mask functions imported from {rtweet}. For instance, get_replies() is now create_replies_edgelist().

Typo in “See Also Compare to other tidtags functions such as get_replies(), get_retweets(), get_quotes(), and create_edgelist().”

Thank you, we have corrected this typo.

Documentation of get_quotes and get_replies: The example returns an empty tibble (0 lines).

Thank you for catching this. We have expanded the imported example to the full dataset, not just 10 rows. These examples now return edgelists with content.

Documentation of get_retweets: as for get_mentions, the function has the same name as a function from {rtweet}

We addressed a similar concern in our response to comment 1.29 and 2.27. We have changed the names of these functions so that they do not mask functions imported from {rtweet}. For instance, get_replies() is now create_replies_edgelist().

Documentation of get_upstream_replies: This function does more than just adding replies, it also computes new variables (“word_count”, “character_count”...). This is because get_upstream_replies() calls process_tweets(). It is not clearly stated in the documentation.

See our comments on 1.28. We have removed the call to process_tweets() in order to speed up the function.

Documentation of lookup_many_tweets: Missing example.

Documentation of pull_tweet_data: I would add an intermediate line to avoid repetition of code in the examples like the example below:

example_url <- "18clYlQeJOc6W5QRuSlJ6_v3snqKJImFhU42bRkM_OX8"
tags_content <- read_tags(example_url)
pull_tweet_data(tags_content[1:10, ])

Thank you for this suggestion for cleaning up the code in this example. We have made the suggested change.

And I would add some comments to explain the different examples.

We have added comments to and simplified the pull_tweet_data() examples.

I don’t understand the definition of id_vector and n, and why pull_tweet_data(tags_content[1:10, ]) returns only 7 lines, although there is 10 different tweet IDs in id_str according to unique(tags_content[1:10, ]$id_str). As id_vector is the parameter statuses in rtweet::lookup_statuses, it would maybe be better to inherit the parameter. At least, I would use the same vocabulary, and notably talk about “statuses” (a Twitter status is a tweet, a retweet, a quote, or a reply).

We left a comment in the example explaining why sometimes fewer rows than expected are returned: “Specifying the parameter n clarifies how many statuses to look up, but the returned values may be less than n because some statuses may have been deleted or made protected since the TAGS tracker originally recorded them.”

Thank you for pointing out the distinction between “tweet” and “status”. We do want to follow the language of the Twitter API and {rtweet}, so we have updated our language to use “statuses” rather than “tweets” in most cases.

VIGNETTES

For both vignettes, I would put more information in bold, because there is quite some text.

Thank you for this suggestion. We have bolded key terms in both vignettes.

Comments on the Vignette Getting started with tidytags

Pain Point 2. is missing from the list in the intro paragraph: “- Getting and storing a Google API key”. I suggest adding the same checklist as in the README.

Thank you. We have added this to the list in this paragraph and have also added this checklist to the README.

In Pain point 1, rather than “A core functionality of {tidytags} is collecting tweets continuously with a Twitter Archiving Google Sheet (TAGS).”, I would write “A core functionality of {tidytags} is to retrieve tweets data from a Twitter Archiving Google Sheet (TAGS). A TAGS tracker continuously collects tweets from Twitter, based on predefined search criteria and collection frequency.”

We have made this change.

In Pain point 1, rather than “Here we offer a brief overview, but be sure to read through the information on…”, I would write “Here we offer a brief overview on how to set up TAGS, but be sure to read through the information on…”.

We have made this change.

Missing info in Pain Point 2: I lost some time trying to do the following steps:
1. Enable the Google Sheets API in the Google Developers Console.
2. Create an API key by clicking the CREATE CREDENTIALS button on the API credentials page. Name this key with a clearly identifiable title, such as “API key for tidytags.”

We apologize for this confusion; as noted in our response to Reviewer #1, the Google Sheets authentication process has changed. We have a) updated this section of the Getting Started vignette and b) added screenshots to facilitate the process of creating and storing (within R/RStudio) this key.

The link to the Google Developers Console redirected me on a page where I didn’t find how to enable the Google Sheets API. I ended searching for the Google Spreadsheet API in the Library and enabling it from there. Then I lost some additional time finding how to create credentials.

Related to comment #39 above, we have made changes to this section to address this issue.

For each step, I would add an example with tidytags functions to test that the set up is correct (test API keys and test access to TAGS).

Thank you for this suggestion. For each of the pain points, we have added code that users can run to ensure that they have set up the components required to use the package successfully.

Comments on the vignette Using tidytags with a conference hashtag

Please, add the link to the online version of Fiesler & Proferes, 2018, ie https://journals.sagepub.com/doi/full/10.1177/2056305118763366

Thank you, we have created a hyperlink to this reference.

In “In sum, although a TAGS tracker is great for easily collecting tweets over time (breadth), it lacks depth in terms of metadata is returned related to the gathered tweets”, I would remove “is returned” and write “In sum, although a TAGS tracker is great for easily collecting tweets over time (breadth), it lacks depth in terms of metadata related to the gathered tweets”.

Thank you for catching this confusing language. We have made the suggested change.

Maybe, since the vignette does not seem to compile on real data, say that the analysis was run at a certain date. As you add a snapshot of the network of users, you could add a snapshot of a map.

Thank you for pointing this out. The vignette does compile on real data, but it is precomputed following the guide here (https://ropensci.org/blog/2019/12/08/precompute-vignettes/) because it takes a long time to compute. We have added this language to the vignette:

When this vignette was run on r format(Sys.Date(), "%b %d %y"), the TAGS tracker had collected...

I am missing a sketch of the analysis workflow, and an explanation about the function categories (read_tags, pull_tweet_data and process tweets gather data and create new variables).

We have added a full workflow for the package’s functions to the README. We think this will significantly help users to understand how the package’s functions work together; thank you for this suggestion.

INSPECT CODE

I would remove all .DS_Store and apply usethis::git_vaccinate().

Thank you. We have removed .DS_Store and applied usethis::git_vaccinate().

There is code duplication in the get_replies, get_quotes, get_retweets and get_mentions. I would create an internal get_content function that takes as input parameters df and type, type being “reply”, “quote”, “retweet” or “mentions”.

As noted in response to comment 2.9, we have completely revamped create_edgelist() to accept an input parameter of type, and these old functions have been removed.

For the add_users_data() function, I would add a lookup_many_users function, similar to lookup_many_tweets and add some warnings in add_users_data() about the limit of 90 000 users, similar to what is in pull_tweet_data.

We have added a new function, lookup_many_users() as well as added testing coverage. We then updated the add_users_data() function accordingly.

In my opinion, functions are missing some checks about input data type, for instance does df contain at least one row and is df containing certain column names (that are used in the function after)? does the GOOGLE_API_KEY or the OPENCAGE_KEY exist? in the get_url_domain, is x a character string? Is the edgelist really a dataframe with two columns named receiver and sender?

Thank you for this feedback. We considered and discussed extensively whether to add additional checks for input data type and whether keys exist. We agree that this would be informative for users. But, we think the changes that we made in response to comment 45 (above) serve as an alternative way to achieve the same goal and so have decided to not add further input data checks or checks for the existence of the keys at this time.

RopenSci development guide says to “Add #’ @nord to internal functions”. I would add it to the only internal function I found: tidytags:::length_with_na().

Thank you for catching this. We have added this line as suggested.

Hi @bretsw I've read the response but I probably won't have time to look into the package and replies with detail until a couple of weeks from now.

@maelle Can we use the @ropensci-review-bot bot now on this issue? Just to check the package online and not only on my machine (I might forget some of the checks)

Thank you, @llrs, for taking a look so quickly at our response. I completely understand needing a few weeks to really get into it. Take your time! Thank you for your patience.

@llrs: @maelle is taking a well-deserved break, but i'll call the bot checks for you

@ropensci-review-bot check package

Thanks, about to send the query.

Checks for tidytags (v0.2.1)

git hash: b82142db

:heavy_check_mark: Package name is available
:heavy_check_mark: has a 'CITATION' file.
:heavy_multiplication_x: does not have a 'codemeta.json' file.
:heavy_check_mark: has a 'contributing' file.
:heavy_check_mark: uses 'roxygen2'.
:heavy_check_mark: 'DESCRIPTION' has a URL field.
:heavy_check_mark: 'DESCRIPTION' has a BugReports field.
:heavy_check_mark: Package has at least one HTML vignette
:heavy_check_mark: All functions have examples.
:heavy_check_mark: Package has continuous integration checks.
:heavy_check_mark: Package coverage is 88.7%.
:heavy_multiplication_x: Package contains unexpected files.
:heavy_check_mark: R CMD check found no errors.
:heavy_check_mark: R CMD check found no warnings.

Important: All failing checks above must be addressed prior to proceeding

Package License: MIT + file LICENSE

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 6 files) and - 2 authors - 2 vignettes - no internal data file - 9 imported packages - 12 exported functions (median 17 lines of code) - 16 non-exported functions in R (median 20 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-------:|----------:|:----------| |files_R | 6| 35.8| | |files_vignettes | 2| 83.0| | |files_tests | 28| 97.1| | |loc_R | 320| 30.3| | |loc_vignettes | 280| 76.9| | |loc_tests | 1054848| 100.0|TRUE | |num_vignettes | 2| 87.5| | |n_fns_r | 28| 29.3| | |n_fns_r_exported | 12| 48.2| | |n_fns_r_not_exported | 16| 24.5| | |n_fns_per_file_r | 2| 34.4| | |num_params_per_fn | 2| 10.7| | |loc_per_fn_r | 18| 68.0| | |loc_per_fn_r_exp | 18| 43.5| | |loc_per_fn_r_not_exp | 20| 77.1| | |rel_whitespace_R | 14| 28.6| | |rel_whitespace_vignettes | 71| 94.2| | |rel_whitespace_tests | 0| 86.9| | |doclines_per_fn_exp | 32| 37.8| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 18| 36.9| | ---

1a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

2. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

#### 3a. Continuous Integration Badges [![https\:\/\/github](https://github.com/ropensci/software-review/issues/382">

bretsw commented 2 years ago

I've removed the .DS_Store files (and added this to .gitignore). I've fixed the lines longer than 80 characters. I've added the 'codemeta.json' file.

Should be ready for @ropensci-review-bot check package again.

@ropensci-review-bot check package

Thanks, about to send the query.

Error (500). The editorcheck service is currently unavailable