Pre-submission inquiry for {excluder}: Checks for Exclusion Criteria in Online Data

JeffreyRStevens commented 3 years ago

Submitting Author: Jeffrey Stevens (@JeffreyRStevens)
Repository: https://github.com/JeffreyRStevens/excluder Submission type: Pre-submission

Paste the full DESCRIPTION file inside a code block below:

Package: excluder
Title: Checks for Exclusion Criteria in Online Data
Version: 0.2.1
Authors@R: 
    person(given = "Jeffrey R.",
           family = "Stevens",
           role = c("aut", "cre"),
           email = "jeffrey.r.stevens@gmail.com",
           comment = c(ORCID = "0000-0003-2375-1360"))
Description: Data that are collected through online sources such as Mechanical 
            Turk may require excluding data because of IP address duplication, 
            geolocation, or completion duration. This package facilitates
            exclusion of these data for Qualtrics datasets.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
URL: https://jeffreyrstevens.github.io/excluder/, https://github.com/jeffreyrstevens/excluder/
BugReports: https://github.com/jeffreyrstevens/excluder/issues/
Imports: 
    dplyr,
    iptools,
    janitor,
    lubridate,
    maps,
    tidyr,
    magrittr,
    lifecycle,
    rlang
Depends: 
    R (>= 3.5.0)
Suggests: 
    testthat (>= 3.0.0),
    readr,
    stringr,
    covr,
    knitr,
    rmarkdown
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):

Data Lifecycle Packages
- [ ] data retrieval
- [ ] data extraction
- [ ] database access
- [X] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text data
  
  Statistical Packages
- [ ] Bayesian and Monte Carlo Routines
- [ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
- [ ] Machine Learning
- [ ] Regression and Supervised Learning
- [ ] Exploratory Data Analysis (EDA) and Summary Statistics
- [ ] Spatial Analyses
- [ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:

The package falls under data munging because it processes data from Qualtrics or other online sources by checking for, marking, and excluding rows of data frames for common exclusion criteria (e.g., IP addresses outside of the United States or duplicate entries from the same location/IP address).

If submitting a statistical package, have you already incorporated documentation of standards into your code via the srr package?

N/A

Who is the target audience and what are scientific applications of this package?

The target audience is data scientists using Qualtrics or other online systems to collect data from participants (e.g., Mechanical Turk workers). Ensuring good data quality from these participants can be tricky. For instance, while Mechanical Turk in theory screens workers based on location (e.g., if you want to restrict your participant pool to workers in the United States), this is not necessarily represented in the data. Finding the tools to screen for IP address location can be tricky, and this package simplifies checking for and excluding participants based on common data that Qualtrics reports such as geolocation, IP address, duplicate records from the same location, participant screen resolution, participant progress through the survey, and survey completion duration.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

There are no similar packages to my knowledge. The {qualtRics} package at rOpenSci focuses on importing data from Qualtrics. The {MTurkR} package directly interfaces with the MTurk Requestor API, but the APIs have been deprecated and the package has been removed from CRAN.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Yes, it seems to comply with this guidance. Depending on the data that the user collects, there could be personally identifiable information accessed by this package. In particular, IP addresses that are recorded by Qualtrics can be processed with this package. Note that the package only works with personally identifiable information from data sets that already exist on the users' local file system, and the package does not collect or transmit data in any way. The package also includes a function deindentify() that the user can use to strip location, IP address, language and even participant computer information (e.g., operating system, web browser, screen resolution) from the data frames to deidentify them.

Any other questions or issues we should be aware of?:

I wanted to raise this pre-submission enquiry here because it seems like this package nicely complements the rOpenSci {qualtRics} package.

JeffreyRStevens commented 3 years ago

Be gentle---it's my first R package!

noamross commented 3 years ago

Thanks for opening this inquiry, @JeffreyRStevens, and we're glad you considered rOpenSci for your first package! excluder is well in-scope as it deals with data-manipulation tasks specific to a scientific data source. We look forward to a full submission.

JeffreyRStevens commented 3 years ago

Thanks, @noamross. Just to be clear, should I close this issue and start a new one for the full submission of the package? Or just add the submission here?

noamross commented 3 years ago

I'll close this and you can start a full submission whenever you're ready!

ropensci / software-review

Pre-submission inquiry for {excluder}: Checks for Exclusion Criteria in Online Data #454

Scope