ropensci / software-review

rOpenSci Software Peer Review.
292 stars 104 forks source link

excluder: checks for exclusion criteria in online data #455

Closed JeffreyRStevens closed 3 years ago

JeffreyRStevens commented 3 years ago

Date accepted: 2021-11-04 Submitting Author Name: Jeffrey Stevens Submitting Author Github Handle: !--author1-->@JeffreyRStevens<!--end-author1-- Repository: https://github.com/JeffreyRStevens/excluder Version submitted: 0.2.2 Submission type: Standard Editor: @maurolepore Reviewers: @juliasilge, @jmobrien

Due date for @juliasilge: 2021-09-20 Due date for @jmobrien: 2021-09-20

Archive: TBD Version accepted: TBD


Package: excluder
Title: Checks for Exclusion Criteria in Online Data
Version: 0.2.2
Authors@R: 
    person(given = "Jeffrey R.",
           family = "Stevens",
           role = c("aut", "cre"),
           email = "jeffrey.r.stevens@gmail.com",
           comment = c(ORCID = "0000-0003-2375-1360"))
Description: Data that are collected through online sources such as Mechanical 
            Turk may require excluding data because of IP address duplication, 
            geolocation, or completion duration. This package facilitates
            exclusion of these data for Qualtrics datasets.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
URL: https://jeffreyrstevens.github.io/excluder/, https://github.com/jeffreyrstevens/excluder/
BugReports: https://github.com/jeffreyrstevens/excluder/issues/
Imports: 
    dplyr,
    iptools,
    janitor,
    lubridate,
    maps,
    tidyr,
    magrittr,
    rlang
Depends: 
    R (>= 3.5.0)
Suggests: 
    testthat (>= 3.0.0),
    readr,
    stringr,
    covr,
    knitr,
    rmarkdown,
    lifecycle
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

The package falls under data munging because it processes data from Qualtrics or other online sources by checking for, marking, and excluding rows of data frames for common exclusion criteria (e.g., IP addresses outside of the United States or duplicate entries from the same location/IP address).

The target audience is data scientists using Qualtrics or other online systems to collect data from participants (e.g., Mechanical Turk workers). Ensuring good data quality from these participants can be tricky. For instance, while Mechanical Turk in theory screens workers based on location (e.g., if you want to restrict your participant pool to workers in the United States), this is not necessarily represented in the data. Finding the tools to screen for IP address location can be tricky, and this package simplifies checking for and excluding participants based on common data that Qualtrics reports such as geolocation, IP address, duplicate records from the same location, participant screen resolution, participant progress through the survey, and survey completion duration.

There are no similar packages to my knowledge. The {qualtRics} package at rOpenSci focuses on importing data from Qualtrics. The {MTurkR} package directly interfaces with the MTurk Requestor API, but the APIs have been deprecated and the package has been removed from CRAN.

Yes, it seems to comply with this guidance. Depending on the data that the user collects, there could be personally identifiable information accessed by this package. In particular, IP addresses that are recorded by Qualtrics can be processed with this package. Note that the package only works with personally identifiable information from data sets that already exist on the users' local file system, and the package does not collect or transmit data in any way. The package also includes a function deidentify() that the user can use to strip location, IP address, language and even participant computer information (e.g., operating system, web browser, screen resolution) from the data frames to deidentify them.

https://github.com/ropensci/software-review/issues/454

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

noamross commented 3 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

ropensci-review-bot commented 3 years ago

Checks for excluder (v0.2.2)

git hash: 1d8446c9

Important: All failing checks above must be addressed prior to proceeding

Package License: GPL (>= 3)


1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 35 files) and - 1 authors - 1 vignette - 3 internal data files - 8 imported packages - 54 exported functions (median 16 lines of code) - no non-exported function in R (median 20 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 35| 91.1| | |files_vignettes | 1| 64.8| | |files_tests | 6| 81.5| | |loc_R | 607| 50.7| | |loc_vignettes | 118| 57.1| | |loc_tests | 252| 57.1| | |num_vignettes | 1| 60.7| | |data_size_total | 18483| 74.3| | |data_size_median | 6139| 75.9| | |n_fns_r | 54| 50.8| | |n_fns_r_exported | 54| 88.4| | |n_fns_r_not_exported | 0| 0.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 4| 54.3| | |loc_per_fn_r | 18| 69.6| | |loc_per_fn_r_exp | 16| 39.6| | |loc_per_fn_r_not_exp | 20| 77.1| | |rel_whitespace_R | 19| 53.9| | |rel_whitespace_vignettes | 34| 75.0| | |rel_whitespace_tests | 15| 78.7| | |doclines_per_fn_exp | 48| 61.4| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 40| 57.0| | ---

1a. Network visualisation

Interactive network visualisation of calls between objects in package can be viewed by clicking here


2. goodpractice and other checks

Details of goodpractice and other checks (click to open)

### 3a. Continuous Integration Badges [![github](https://github.com/jeffreyrstevens/excluder/workflows/R-CMD-check/badge.svg)](https://github.com/jeffreyrstevens/excluder/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:-------------|:----------|:------|:----------| |pkgdown |success |1d8446 |2021-07-26 | |R-CMD-check |success |1d8446 |2021-07-26 | |test-coverage |success |1d8446 |2021-07-26 | --- ### 3b. `goodpractice` results ### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking Rd cross-references ... NOTE Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ ### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 81.79 ### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- check_duplicates | 19 ### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 201 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 201


Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.0.265 | |pkgcheck |0.0.1.367 |


Editor-in-Chief Instructions:

This package may be submitted

JeffreyRStevens commented 3 years ago

The only check that failed was "Package does not have a 'contributing.md' file". Does this mean that the package should not have a contributing.md file? So should I just remove the file and remove all references to Contributing to this package?

noamross commented 3 years ago

Thanks for the submission @JeffreyRStevens! Sorry for a bit of a delay as we were working out our new automated diagnostics bot, which your review is the first to use, and you just found a bug in! Your CONTRIBUTING.md file is fine, we are just failing to check the right subfolder. We'll move ahead with this :)

JeffreyRStevens commented 3 years ago

Ah, OK--no problem. Glad to be a guinea pig to help debug!

maurolepore commented 3 years ago

@JeffreyRStevens, it's my pleasure to be the handling editor of your submission.

Editor checks:


Editor comments

Congratulations! The bot and I are very happy to see the package meets rOpenSci's guidelines :+1:. I'll start looking for reviewers.

Note the following minor issues. You might want to consider before the review:

❯ checking Rd cross-references ... NOTE
  Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’
# Good
dupl_ip <- TRUE
if (identical(dupl_ip, TRUE)) {
  message("Do something.")
}
#> Do something.

# Good
dupl_ip <- TRUE
stopifnot(length(dupl_ip) == 1L)
if (dupl_ip) {
  message("Do something.")
}
#> Do something.

# Fragile
dupl_ip <- c(TRUE, FALSE)  # This might be accidentally non-atomic
if (dupl_ip == TRUE) {
  message("Do something.")
}
#> Warning in if (dupl_ip == TRUE) {: the condition has length > 1 and only the
#> first element will be used
#> Do something.

Created on 2021-08-07 by the reprex package (v2.0.0)

check-mark-exclude: 
Note: Using an external vector in selections is ambiguous.
i Use `all_of(location_col)` instead of `location_col` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

Reviewers:

Due date:

maurolepore commented 3 years ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 3 years ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/455_status.svg)](https://github.com/ropensci/software-review/issues/455)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

maurolepore commented 3 years ago

@JeffreyRStevens, could you suggest two or three potential reviewers? Although I wouldn't pick more than one, your list will inform the type of expertise you think would be useful when reviewing {excluder}.

I'll also use other criteria as described in How to look for reviewers.

JeffreyRStevens commented 3 years ago

@maurolepore thank you for your speedy initial review of {excluder}. I have pushed some changes to address your comments.

Thank you again for your careful review, and let me know if I missed something.

maurolepore commented 3 years ago

Thanks @JeffreyRStevens for responding quickly.

Output ``` r devtools::load_all() #> ℹ Loading excluder packageVersion("excluder") #> [1] '0.2.2' gert::git_log(max = 1) #> # A tibble: 1 × 6 #> commit author time files merge message #> * #> 1 d99bc23850c3f… Jeffrey R. Ste… 2021-08-08 15:06:45 29 FALSE "Change all te… devtools::check() #> ℹ Updating excluder documentation #> ℹ Loading excluder #> Warning: [/home/mauro/git/excluder/R/check_duplicates.R:9] @details Link #> to unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_duration.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_ip.R:9] @details Link to unavailable #> package: qualtRics::fetch_survey. there is no package called 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_location.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_preview.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_progress.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_resolution.R:10] @details Link #> to unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/qualtrics_numeric.R:3] @description Link #> to unavailable package: rgeolocate::ip2location. there is no package called #> 'rgeolocate' #> Warning: [/home/mauro/git/excluder/R/qualtrics_raw.R:3] @description Link #> to unavailable package: rgeolocate::ip2location. there is no package called #> 'rgeolocate' #> Warning: [/home/mauro/git/excluder/R/qualtrics_text.R:3] @description Link #> to unavailable package: rgeolocate::ip2location. there is no package called #> 'rgeolocate' #> Warning: [/home/mauro/git/excluder/R/remove_label_rows.R:7] @details Link #> to unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Writing NAMESPACE #> Writing NAMESPACE #> ── Building ──────────────────────────────────────────────────────── excluder ── #> Setting env vars: #> • CFLAGS : -Wall -pedantic #> • CXXFLAGS : -Wall -pedantic #> • CXX11FLAGS: -Wall -pedantic #> ──────────────────────────────────────────────────────────────────────────────── #> checking for file ‘/home/mauro/git/excluder/DESCRIPTION’ ... ✓ checking for file ‘/home/mauro/git/excluder/DESCRIPTION’ #> ─ preparing ‘excluder’: #> checking DESCRIPTION meta-information ... ✓ checking DESCRIPTION meta-information #> ─ installing the package to build vignettes #> creating vignettes ... ✓ creating vignettes (2.6s) #> ─ checking for LF line-endings in source and make files and shell scripts #> ─ checking for empty or unneeded directories #> ─ building ‘excluder_0.2.2.tar.gz’ #> #> ── Checking ──────────────────────────────────────────────────────── excluder ── #> Setting env vars: #> • _R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE #> • _R_CHECK_CRAN_INCOMING_REMOTE_ : FALSE #> • _R_CHECK_CRAN_INCOMING_ : FALSE #> • _R_CHECK_FORCE_SUGGESTS_ : FALSE #> • NOT_CRAN : true #> ── R CMD check ───────────────────────────────────────────────────────────────── #> * using log directory ‘/tmp/Rtmpcusbks/excluder.Rcheck’ #> * using R version 4.1.0 (2021-05-18) #> * using platform: x86_64-pc-linux-gnu (64-bit) #> * using session charset: UTF-8 #> * using options ‘--no-manual --as-cran’ #> * checking for file ‘excluder/DESCRIPTION’ ... OK #> * this is package ‘excluder’ version ‘0.2.2’ #> * package encoding: UTF-8 #> * checking package namespace information ... OK #> * checking package dependencies ... OK #> * checking if this is a source package ... OK #> * checking if there is a namespace ... OK #> * checking for executable files ... OK #> * checking for hidden files and directories ... OK #> * checking for portable file names ... OK #> * checking for sufficient/correct file permissions ... OK #> * checking whether package ‘excluder’ can be installed ... OK #> * checking installed package size ... OK #> * checking package directory ... OK #> * checking for future file timestamps ... OK #> * checking ‘build’ directory ... OK #> * checking DESCRIPTION meta-information ... OK #> * checking top-level files ... NOTE #> Non-standard files/directories found at top level: #> ‘mid-guppy_reprex.R’ ‘mid-guppy_reprex.md’ ‘ok-coqui_reprex.R’ #> ‘ok-coqui_reprex.spin.R’ ‘ok-coqui_reprex.spin.Rmd’ #> * checking for left-over files ... OK #> * checking index information ... OK #> * checking package subdirectories ... OK #> * checking R files for non-ASCII characters ... OK #> * checking R files for syntax errors ... OK #> * checking whether the package can be loaded ... OK #> * checking whether the package can be loaded with stated dependencies ... OK #> * checking whether the package can be unloaded cleanly ... OK #> * checking whether the namespace can be loaded with stated dependencies ... OK #> * checking whether the namespace can be unloaded cleanly ... OK #> * checking loading without being on the library search path ... OK #> * checking dependencies in R code ... OK #> * checking S3 generic/method consistency ... OK #> * checking replacement functions ... OK #> * checking foreign function calls ... OK #> * checking R code for possible problems ... NOTE #> collapse_exclusions: no visible binding for global variable #> ‘exclusions’ #> Undefined global functions or variables: #> exclusions #> * checking Rd files ... OK #> * checking Rd metadata ... OK #> * checking Rd line widths ... OK #> * checking Rd cross-references ... NOTE #> Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ #> * checking for missing documentation entries ... OK #> * checking for code/documentation mismatches ... OK #> * checking Rd \usage sections ... OK #> * checking Rd contents ... OK #> * checking for unstated dependencies in examples ... OK #> * checking contents of ‘data’ directory ... OK #> * checking data for non-ASCII characters ... OK #> * checking LazyData ... OK #> * checking data for ASCII and uncompressed saves ... OK #> * checking installed files from ‘inst/doc’ ... OK #> * checking files in ‘vignettes’ ... OK #> * checking examples ... OK #> * checking for unstated dependencies in ‘tests’ ... OK #> * checking tests ... #> Running ‘testthat.R’ #> OK #> * checking for unstated dependencies in vignettes ... OK #> * checking package vignettes in ‘inst/doc’ ... OK #> * checking re-building of vignette outputs ... OK #> * checking for non-standard things in the check directory ... OK #> * checking for detritus in the temp directory ... OK #> * DONE #> #> Status: 3 NOTEs #> See #> ‘/tmp/Rtmpcusbks/excluder.Rcheck/00check.log’ #> for details. #> ── R CMD check results ───────────────────────────────────── excluder 0.2.2 ──── #> Duration: 56.5s #> #> > checking top-level files ... NOTE #> Non-standard files/directories found at top level: #> ‘mid-guppy_reprex.R’ ‘mid-guppy_reprex.md’ ‘ok-coqui_reprex.R’ #> ‘ok-coqui_reprex.spin.R’ ‘ok-coqui_reprex.spin.Rmd’ #> #> > checking R code for possible problems ... NOTE #> collapse_exclusions: no visible binding for global variable #> ‘exclusions’ #> Undefined global functions or variables: #> exclusions #> #> > checking Rd cross-references ... NOTE #> Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ #> #> 0 errors ✓ | 0 warnings ✓ | 3 notes x ```
Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.0 (2021-05-18) #> os Ubuntu 20.04.2 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Mazatlan #> date 2021-08-10 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date lib source #> AsioHeaders 1.16.1-1 2020-07-07 [1] RSPM (R 4.1.0) #> askpass 1.1 2019-01-13 [1] CRAN (R 4.1.0) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0) #> callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> commonmark 1.7 2018-12-01 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> credentials 1.3.1 2021-07-25 [1] RSPM (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0) #> devtools 2.4.2 2021-06-07 [1] CRAN (R 4.1.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> P excluder * 0.2.2 2021-08-07 [?] local #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] RSPM (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> gert 1.3.1 2021-06-23 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) #> iptools 0.6.1 2018-12-09 [1] RSPM (R 4.1.0) #> janitor 2.1.0 2021-01-05 [1] CRAN (R 4.1.0) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.3) #> memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) #> openssl 1.4.4 2021-04-30 [1] CRAN (R 4.1.0) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) #> pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) #> processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) #> rcmdcheck 1.3.3 2019-05-07 [1] CRAN (R 4.1.0) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) #> readr 2.0.1 2021-08-10 [1] CRAN (R 4.1.0) #> remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0) #> roxygen2 7.1.1 2020-06-27 [1] CRAN (R 4.1.0) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.1.0) #> stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.5.1 2021-07-13 [1] CRAN (R 4.1.0) #> sys 3.4 2020-07-23 [1] CRAN (R 4.1.0) #> testthat * 3.0.4 2021-07-01 [1] CRAN (R 4.1.0) #> tibble 3.1.3 2021-07-23 [1] RSPM (R 4.1.0) #> tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> triebeard 0.3.0 2016-08-04 [1] RSPM (R 4.1.0) #> tzdb 0.1.2 2021-07-20 [1] RSPM (R 4.1.0) #> usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0) #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) #> xopen 1.0.0 2018-09-17 [1] CRAN (R 4.1.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] /home/mauro/R/x86_64-pc-linux-gnu-library/4.1 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> P ── Loaded and on-disk path mismatch. ```

-

Thanks for your work and suggestions for reviewers.

JeffreyRStevens commented 3 years ago

Many thanks for the clarifications and help, @maurolepore.

maurolepore commented 3 years ago

@ropensci-review-bot add @juliasilge to reviewers

ropensci-review-bot commented 3 years ago

@juliasilge added to the reviewers list. Review due date is 2021-09-20. Thanks @juliasilge for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 3 years ago

@juliasilge: If you haven't done so, please fill this form for us to update our reviewers records.

maurolepore commented 3 years ago

@ropensci-review-bot add @jmobrien to reviewers

ropensci-review-bot commented 3 years ago

@jmobrien added to the reviewers list. Review due date is 2021-09-20. Thanks @jmobrien for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 3 years ago

@jmobrien: If you haven't done so, please fill this form for us to update our reviewers records.

maurolepore commented 3 years ago

@JeffreyRStevens, I'm thrilled that @juliasilge and @jmobrien accepted to review the excluder package.

Note @ropensci-review-bot set the due date following rOpenSci's guidelines to 2021-09-20. However, these are difficult times and I would like to allow reviewers 1 extra week. Feel free to submit your review whenever ready but if you don't I'll touch base on 2021-09-27.

I look forward to working with you all.

juliasilge commented 3 years ago

Package Review

Congratulations to the author on this useful package for folks handling Qualtrics surveys. It will be convenient to have all these common checks in a consistent set of functions, and the messaging in the console is very nice. 🙌

Documentation

The package includes all the following forms of documentation:

On documentation, I find some of the way the documentation works a bit confusing, especially when things go wrong for users. For example, if I have a dataset with no IP addresses or location information and I try to run exclude_duplicates(), I get this error:

Error in check_duplicates(x, quiet = TRUE, ...) : 
The column specifying location ('location_col') was not found.

However, when I look at the documentation for the function I used, I don't see anything about location_col and it's not clear which link I should click to find the relevant info. Some options might be to change the documentation for the dots or write a more clear error message.

Functionality

Estimated hours spent reviewing: 2


Review Comments

🎯 The function name collapse_exclusions() reminds me a bit too much of dplyr::collapse() when what it does is much more like tidyr::unite(). What about a name like unite_exclusions()? Along those lines, did you consider names like filter_*() instead of exclude_*()? A benefit to using names that align with other names in the ecosystem is that these functions then slot right in to people's mental model of what they are doing.

🎯 None of my real surveys have IP addresses so I was not able to check the iptools integration beyond the example survey data included in the package.

🎯 In the future, if you wanted to increase the polish of the console messages (once the nicest features of this package that I think will draw folks to use it), you might check out using cli.

maurolepore commented 3 years ago

@juliasilge thanks for your review. @jmobrien when do you think you'll be able to submit yours?

jmobrien commented 3 years ago

I will have it in by later today, thanks.

On Mon, Sep 27, 2021 at 2:18 PM Mauro Lepore @.***> wrote:

@juliasilge https://github.com/juliasilge thanks for your review. @jmobrien https://github.com/jmobrien when do you think you'll be able to submit yours?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/455#issuecomment-928196989, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWEMIWAZC4NNRGMQX7YCLDUEC7RZANCNFSM5BA7NRLQ .

jmobrien commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 7


Review Comments

Overall

I quite like this package. It fills a niche for Qualtrics data cleaning that is difficult to automate and/or make auditable/reproducible, given how Qualtrics designed server-side data filters.

As the author notes, in human participants research it's also important to not unduly exclude data, and the manual review this helps facilitate is supportive of that ethical imperative. So, generally, I think, the paradigm of excluder makes a lot of sense--a simple interface for early-stage review and removal of problem cases.

The package seemed to work as expected. None of the data I had available (apparently) had any non-US participants, so I also was not able to test IP-checking as thoroughly as might be warranted, but it seems like a sound implementation.

A few notes

A few things that might warrant either review or improvement. Nothing terribly serious or strictly essential for fixing right now--though with one issue (centering everything around the check_*() functions), I do wonder that it might make growing/maintaining this package harder in the long run.

maurolepore commented 3 years ago

@jmobrien thanks a lot for your review.

@JeffreyRStevens, you can now go ahead and address reviewer's comments. rOpenSci's guidelines seem to not limit how long you should take, but I think it's best to respond ASAP -- while the reviewers keep the context in their heads.

Please let me know if you have questions.

JeffreyRStevens commented 3 years ago

Yes, many thanks @juliasilge and @jmobrien for your helpful reviews! I've done a quick skim and will review in-depth very soon. I may have a few questions before I get too far into revisions.

JeffreyRStevens commented 3 years ago

@juliasilge, many thanks again for your helpful comments. Here are some responses/questions for your points:

I'll have some questions for @jmobrien in the coming days.

juliasilge commented 3 years ago

I was really happy to see @jmobrien's thorough review and I agree that working through some of his suggestions will address the problems with the documentation and error messages.

I'm not sure what we really mean by a difference between "retain" vs. "throw away" (just reverse the condition, which in the case of, say, IP addresses in/out of the US doesn't seem like a super one-way thing) but I think I see your point. I guess my opinion is that creating three new verbs for operations that are arguably the same as things many people already use makes this toolkit less easy to use than if they were given names that allude to the ways in which they are the same as other operations. Another option may be to use mark_ + filter_ + exclude_?

jmobrien commented 3 years ago

I also will say that I generally agreed with all that @juliasilge said in her review and thought it was on point for usefulness. Where I didn't echo her sentiments directly, it was mostly for brevity because I was the second to submit an already-long review. I especially agree that the console reporting is a big sell here.

Here, @JeffreyRStevens, I'm understanding you as saying exclude_duration([longer than 30 mins]) would remove all >30 minute cases, while filter([duration > 30 mins]) would keep those same cases and discard others. So, at least at default, the behavior of filter is more conceptually similar to check than exclude. That right? If so, I see where you're coming from also.

From a more data-centric perspective on the potential userbase, I agree w/where Julia's coming from. The package does things very close to core tidyverse operations for data wrangling. There's something to be said for capitalizing on what those names invoke.

With my social scientist had on, though, I feel much more positively about the existing names. This whole package really just seems like a batch of convenience functions that abstract lower-level data wrangling operations into key tasks for ethically managing human-participants data. As tasks, I think the names make sense: manually "check" problems first, then "exclude" them prior to analysis.

I don't have a strong opinion about which way naming should go. Though FWIW, I suppose similar ideas are embedded in my speculation whether functionality should center on mark. With that workflow, the check and exclude verbs could keep their core purpose as convenience functions called downstream of mark (give me all the marked stuff for review vs. get rid of it). But if you find yourself needing something more complex, you can just do a standard call to filter instead.

(Or, if you're feeling really ambitious, maybe you could make everyone happy by adding something like filter_exclusions() that could emulate the typical filter syntax for more complex cases where the convenience functions fall short? So, e.g., filter_exclusions(!ip, duration) could mean "give me all the cases that were NOT marked as out of the acceptable IP address space, but which were still too long", etc. Could maybe also work around whether unite_exclusions() has been called yet or not.)

juliasilge commented 3 years ago

I don't know that I'd recommend adding another set of functions to maintain here. And for the record, my opinion on the naming is not super strong either; I'm raising this mostly because this is something I've run into before and in the past I have always regretted when I have not used what already exists in terms of vocabulary, norms, idioms, function names, etc. and then later realized I could have.

JeffreyRStevens commented 3 years ago

Thanks you two for your thoughts on this. Yes, my thought was that filter_, mark_, and exclude_ would more closely match the way filter() works (that is, it retains rows satisfying the condition). But I also share @jmobrien's point that check_ might make more sense from a general user perspective even though filter_ makes sense from a tidyverse user/developer perspective. I'll grapple with this in the coming days. I really like the idea of centering the functions around mark_ rather than check_. @jmobrien, just to clarify, you were also suggesting removing all of the check_ and exclude_ functions and instead only have two functions: check and exclude, correct? So basically, you would always run the mark_ functions to find the exclusions, then pipe that to check to view/store the exclusions or pipe to exclude to remove the exclusions. Is this what you were suggesting? This sounds much easier to maintain and probably to use as well.

jmobrien commented 3 years ago

That was one option, yes. I guess in that simplifying approach check and exclude would need to recognize what parts that mark functions had looked at previously. You could set that up a number of different ways.

You could also preserve the existing check/exclude function sets if you prefer a more standalone design where each verb can operate on fresh data directly. Even then, as outlined above I think there are several benefits to having check and exclude call mark internally, rather than mark and exclude calling check.

There's certainly some draw to option 1. But both options have unique advantages, and I don't want to dictate your design choices.

Option 1's a more major change, though, so if you're leaning that way I might get others' opinions first.

JeffreyRStevens commented 3 years ago

I'm close to finishing up revisions. I have a couple more changes to make and had two questions for @jmobrien.

  1. When you referred to avoiding multiple passes on the data, did you mean within a function or are you concerned about running multiple mark_*() functions?
  2. When you referred to using ellipses to pass arguments to (now) check_*() and exclude+*() functions, you suggested explicitly including some arguments. Can you point me to how to do this? My efforts to do this have not worked out.
jmobrien commented 3 years ago

@JeffreyRStevens, responses below:

  1. I think the extra joins in the current design might slow things down, but I don't know how much. And having just unified check and exclude functions would in theory be slightly more efficient than independent ones, because, e.g., the ip address checking code would only be called at the start when the dataset is marked. My guess is this isn't a big deal in most use cases, though, and probably even less than I was thinking it might be even with big datasets. So, it's a point of optimization, but not a big priority.

  2. Can you clarify what's giving you trouble? In many cases it can be handled straightforwardly. Here's an example from qualtRics, where fetch_survey() has arguments that get passed directly to read_survey(), using the same argument names in both functions: https://github.com/ropensci/qualtRics/blob/master/R/fetch_survey.R

JeffreyRStevens commented 3 years ago

Thanks, @jmobrien. I figured out (2) with your examples. I expect to have everything wrapped up in the next week.

JeffreyRStevens commented 3 years ago

OK, I think that I have addressed all of the concerns. Please let me know if I missed something or you have additional questions. Note that I have increased the version to 0.3.0 since I have made major changes to the code and deprecated a function.

Responses to @juliasilge.

Responses to @jmobrien

maurolepore commented 3 years ago

Thanks everyone! It's been great to watch this review.

Dear reviewers @juliasilge and @jmobrien,

At this stage, we ask that you respond as to whether the changes sufficiently address any issues raised in your review. We encourage ongoing discussion between package authors and reviewers, and you may ask editors to clarify issues in the review thread as well. -- https://devguide.ropensci.org/reviewerguide.html#followupreviewer

jmobrien commented 3 years ago

Generally, I think this looks quite good. I just want to do a bit more due diligence on the recentering to mark before giving final approval, since that was the most major suggestion from me.

I did happen to look at the *_ip() family of functions about a week ago (following the reorganization, just before @JeffreyRStevens 's most recent post). One thing I noticed was that the transition towards mark didn't end up streamlining the internals of data processing in quite the way I expected.

BUT--I then fiddled with it briefly myself, and started to think my expectations were mis-calibrated. In the *_ip() functions, for instance, the iptools functions being used have some unforgiving aspects that needed designing around. I didn't previously appreciate that issue, and it's changed my perspective quite a bit.

So, I still want to complete a full look-though, but I'm feeling pretty optimistic that everything should be fine.

juliasilge commented 3 years ago
print_exclusion <- function(remaining_data, x, msg) {
  n_remaining <- nrow(remaining_data)
  n_exclusions <- nrow(x) - n_remaining
    message(
      n_exclusions, " out of ", nrow(x),
      msg, n_remaining, " rows."
    )
}

You might want to make something like that for all the invisible returning too.

JeffreyRStevens commented 3 years ago

@juliasilge, thank you for the follow up. That is a great idea to start building more utility functions to reduce repeated code. I'm now working on both the print_exclusion() function and the invisible returning (and will look for other opportunities to replace repeated code with functions). As a side note, I started using the {cli} package to retool the print exclusion messages, and I really like it. So I'm going to go ahead and work at revamping all of the messages.

Yes, I switched the default for the exclude_*() functions to be print = FALSE in a recent update. My thinking was that you probably do want to actually see the 'problematic' rows when using the check_*() functions because you are checking. You don't need to see the rows when using the mark_*() functions because it returns the whole data set with columns added to the end (which would be difficult to see when printed to the console). When excluding, you likely don't need to see the remaining rows because (1) they are not problematic and (2) likely the vast majority of the original data will not be excluded, so there will be a lot of rows that would be printed to the console. But if you think that it makes more sense to return to printing the output of the exclude_*() functions, I can do that---especially if you were expecting to see the output.

maurolepore commented 3 years ago

@JeffreyRStevens, have you seen this?:

Ellipses:

When to print the return value or return the first argument invisibly:

JeffreyRStevens commented 3 years ago

@juliasilge I have now created four new utility functions (in utils.R) that clean up code in the verb functions.

I have also switched all messages to the cli::cli_alert_info() syntax (though error messages continue to use stop()).

I have made print = TRUE the default for all exclude_*() functions.

juliasilge commented 3 years ago
JeffreyRStevens commented 3 years ago

@juliasilge

@jmobrien, will you be able to wrap up your review soon?

JeffreyRStevens commented 3 years ago

@maurolepore, what are the next steps now? FYI, I'm working on a paper for JOSS. I'm not sure if that should be part of the review here or just over at JOSS.

maurolepore commented 3 years ago

@JeffreyRStevens,

I'm watching your discussion with the reviewers. Do you feel the ball is no longer on your court? If so I would ask the reviewers once again whether your changes sufficiently address any issues they raised in the review. Please confirm and I'll write a new comment with a @mention to them.

We no longer review submissions to JOSS, but do mention the review -- I think you might get fast-tracking.

JeffreyRStevens commented 3 years ago

@maurolepore, yes, I think that I've addressed the reviewers concerns, and I'm waiting for their final approval. Thanks.

maurolepore commented 3 years ago

Dear reviewers,

@jmobrien, do you have an estimate about when you might be able to confirm whether the changes made are sufficient to approve this package?

@juliasilge, it seems your last concern (return visibly) has been addressed. Is there anything else you'd like to request or do you confirm the changes made are sufficient to approve this package?

jmobrien commented 3 years ago

Mauro, and Jeffrey, apologies--I've had a remarkable variety of unexpected home and family issues come up over just this past week. I'm about to go handle what I hope is the last of them, but I'll make it a priority to finish everything tomorrow. Thanks for your patience.

On Sat, Oct 30, 2021 at 8:37 PM Mauro Lepore @.***> wrote:

Dear reviewers,

@jmobrien https://github.com/jmobrien, do you have an estimate about when you might be able to confirm whether the changes made are sufficient to approve this package?

@juliasilge https://github.com/juliasilge, it seems your last concern (return visibly) has been addressed. Is there anything else you'd like to request or do you confirm the changes made are sufficient to approve this package?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/455#issuecomment-955620821, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWEMISKFWMP2CRZQHAGBG3UJSMWRANCNFSM5BA7NRLQ .

juliasilge commented 3 years ago

For me, the changes have addressed the issues I was interested in, and I approve it moving forward. ✅