Closed JeffreyRStevens closed 3 years ago
Be gentle---it's my first R package!
Thanks for opening this inquiry, @JeffreyRStevens, and we're glad you considered rOpenSci for your first package! excluder is well in-scope as it deals with data-manipulation tasks specific to a scientific data source. We look forward to a full submission.
Thanks, @noamross. Just to be clear, should I close this issue and start a new one for the full submission of the package? Or just add the submission here?
I'll close this and you can start a full submission whenever you're ready!
Submitting Author: Jeffrey Stevens (@JeffreyRStevens)
Repository: https://github.com/JeffreyRStevens/excluder Submission type: Pre-submission
Scope
Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):
Data Lifecycle Packages
[ ] data retrieval
[ ] data extraction
[ ] database access
[X] data munging
[ ] data deposition
[ ] workflow automation
[ ] version control
[ ] citation management and bibliometrics
[ ] scientific software wrappers
[ ] database software bindings
[ ] geospatial data
[ ] text data
Statistical Packages
[ ] Bayesian and Monte Carlo Routines
[ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
[ ] Machine Learning
[ ] Regression and Supervised Learning
[ ] Exploratory Data Analysis (EDA) and Summary Statistics
[ ] Spatial Analyses
[ ] Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
The package falls under data munging because it processes data from Qualtrics or other online sources by checking for, marking, and excluding rows of data frames for common exclusion criteria (e.g., IP addresses outside of the United States or duplicate entries from the same location/IP address).
N/A
The target audience is data scientists using Qualtrics or other online systems to collect data from participants (e.g., Mechanical Turk workers). Ensuring good data quality from these participants can be tricky. For instance, while Mechanical Turk in theory screens workers based on location (e.g., if you want to restrict your participant pool to workers in the United States), this is not necessarily represented in the data. Finding the tools to screen for IP address location can be tricky, and this package simplifies checking for and excluding participants based on common data that Qualtrics reports such as geolocation, IP address, duplicate records from the same location, participant screen resolution, participant progress through the survey, and survey completion duration.
There are no similar packages to my knowledge. The {qualtRics} package at rOpenSci focuses on importing data from Qualtrics. The {MTurkR} package directly interfaces with the MTurk Requestor API, but the APIs have been deprecated and the package has been removed from CRAN.
Yes, it seems to comply with this guidance. Depending on the data that the user collects, there could be personally identifiable information accessed by this package. In particular, IP addresses that are recorded by Qualtrics can be processed with this package. Note that the package only works with personally identifiable information from data sets that already exist on the users' local file system, and the package does not collect or transmit data in any way. The package also includes a function
deindentify()
that the user can use to strip location, IP address, language and even participant computer information (e.g., operating system, web browser, screen resolution) from the data frames to deidentify them.I wanted to raise this pre-submission enquiry here because it seems like this package nicely complements the rOpenSci {qualtRics} package.