excluder: checks for exclusion criteria in online data

JeffreyRStevens commented 3 years ago

Date accepted: 2021-11-04 Submitting Author Name: Jeffrey Stevens Submitting Author Github Handle: !--author1-->@JeffreyRStevens<!--end-author1-- Repository: https://github.com/JeffreyRStevens/excluder Version submitted: 0.2.2 Submission type: Standard Editor: @maurolepore Reviewers: @juliasilge, @jmobrien

Due date for @juliasilge: 2021-09-20 Due date for @jmobrien: 2021-09-20

Archive: TBD Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: excluder
Title: Checks for Exclusion Criteria in Online Data
Version: 0.2.2
Authors@R: 
    person(given = "Jeffrey R.",
           family = "Stevens",
           role = c("aut", "cre"),
           email = "jeffrey.r.stevens@gmail.com",
           comment = c(ORCID = "0000-0003-2375-1360"))
Description: Data that are collected through online sources such as Mechanical 
            Turk may require excluding data because of IP address duplication, 
            geolocation, or completion duration. This package facilitates
            exclusion of these data for Qualtrics datasets.
License: GPL (>= 3)
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
URL: https://jeffreyrstevens.github.io/excluder/, https://github.com/jeffreyrstevens/excluder/
BugReports: https://github.com/jeffreyrstevens/excluder/issues/
Imports: 
    dplyr,
    iptools,
    janitor,
    lubridate,
    maps,
    tidyr,
    magrittr,
    rlang
Depends: 
    R (>= 3.5.0)
Suggests: 
    testthat (>= 3.0.0),
    readr,
    stringr,
    covr,
    knitr,
    rmarkdown,
    lifecycle
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [X] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package falls under data munging because it processes data from Qualtrics or other online sources by checking for, marking, and excluding rows of data frames for common exclusion criteria (e.g., IP addresses outside of the United States or duplicate entries from the same location/IP address).

Who is the target audience and what are scientific applications of this package?

The target audience is data scientists using Qualtrics or other online systems to collect data from participants (e.g., Mechanical Turk workers). Ensuring good data quality from these participants can be tricky. For instance, while Mechanical Turk in theory screens workers based on location (e.g., if you want to restrict your participant pool to workers in the United States), this is not necessarily represented in the data. Finding the tools to screen for IP address location can be tricky, and this package simplifies checking for and excluding participants based on common data that Qualtrics reports such as geolocation, IP address, duplicate records from the same location, participant screen resolution, participant progress through the survey, and survey completion duration.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

There are no similar packages to my knowledge. The {qualtRics} package at rOpenSci focuses on importing data from Qualtrics. The {MTurkR} package directly interfaces with the MTurk Requestor API, but the APIs have been deprecated and the package has been removed from CRAN.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Yes, it seems to comply with this guidance. Depending on the data that the user collects, there could be personally identifiable information accessed by this package. In particular, IP addresses that are recorded by Qualtrics can be processed with this package. Note that the package only works with personally identifiable information from data sets that already exist on the users' local file system, and the package does not collect or transmit data in any way. The package also includes a function deidentify() that the user can use to strip location, IP address, language and even participant computer information (e.g., operating system, web browser, screen resolution) from the data frames to deidentify them.

If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

https://github.com/ropensci/software-review/issues/454

Technical checks

Confirm each of the following by checking the box.

[X] I have read the guide for authors and rOpenSci packaging guide.

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions, created with roxygen2.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[X] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

noamross commented 3 years ago

@ropensci-review-bot check package

ropensci-review-bot commented 3 years ago

Thanks, about to send the query.

ropensci-review-bot commented 3 years ago

Checks for excluder (v0.2.2)

git hash: 1d8446c9

:heavy_check_mark: Package uses 'roxygen2'
:heavy_check_mark: Package has a 'contributing.md' file
:heavy_check_mark: Package has a 'CITATION' file
:heavy_check_mark: Package has a 'codemeta.json' file
:heavy_check_mark: All functions have examples
:heavy_check_mark: Package 'DESCRIPTION' has a URL field
:heavy_check_mark: Package 'DESCRIPTION' has a BugReports field
:heavy_check_mark: Package name is available
:heavy_check_mark: Package has continuous integration checks
:heavy_check_mark: Package coverage is 81.8%
:heavy_check_mark: R CMD check found no errors
:heavy_check_mark: R CMD check found no warnings

Important: All failing checks above must be addressed prior to proceeding

Package License: GPL (>= 3)

1. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has: - code in R (100% in 35 files) and - 1 authors - 1 vignette - 3 internal data files - 8 imported packages - 54 exported functions (median 16 lines of code) - no non-exported function in R (median 20 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 35| 91.1| | |files_vignettes | 1| 64.8| | |files_tests | 6| 81.5| | |loc_R | 607| 50.7| | |loc_vignettes | 118| 57.1| | |loc_tests | 252| 57.1| | |num_vignettes | 1| 60.7| | |data_size_total | 18483| 74.3| | |data_size_median | 6139| 75.9| | |n_fns_r | 54| 50.8| | |n_fns_r_exported | 54| 88.4| | |n_fns_r_not_exported | 0| 0.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 4| 54.3| | |loc_per_fn_r | 18| 69.6| | |loc_per_fn_r_exp | 16| 39.6| | |loc_per_fn_r_not_exp | 20| 77.1| | |rel_whitespace_R | 19| 53.9| | |rel_whitespace_vignettes | 34| 75.0| | |rel_whitespace_tests | 15| 78.7| | |doclines_per_fn_exp | 48| 61.4| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 40| 57.0| | ---

1a. Network visualisation

Interactive network visualisation of calls between objects in package can be viewed by clicking here

2. `goodpractice` and other checks

Details of goodpractice and other checks (click to open)

### 3a. Continuous Integration Badges [![github](https://github.com/jeffreyrstevens/excluder/workflows/R-CMD-check/badge.svg)](https://github.com/jeffreyrstevens/excluder/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:-------------|:----------|:------|:----------| |pkgdown |success |1d8446 |2021-07-26 | |R-CMD-check |success |1d8446 |2021-07-26 | |test-coverage |success |1d8446 |2021-07-26 | --- ### 3b. `goodpractice` results ### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking Rd cross-references ... NOTE Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ ### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 81.79 ### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- check_duplicates | 19 ### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 201 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 201

Package Versions

|package |version | |:--------|:---------| |pkgstats |0.0.0.265 | |pkgcheck |0.0.1.367 |

Editor-in-Chief Instructions:

This package may be submitted

JeffreyRStevens commented 3 years ago

The only check that failed was "Package does not have a 'contributing.md' file". Does this mean that the package should not have a contributing.md file? So should I just remove the file and remove all references to Contributing to this package?

noamross commented 3 years ago

Thanks for the submission @JeffreyRStevens! Sorry for a bit of a delay as we were working out our new automated diagnostics bot, which your review is the first to use, and you just found a bug in! Your CONTRIBUTING.md file is fine, we are just failing to check the right subfolder. We'll move ahead with this :)

JeffreyRStevens commented 3 years ago

Ah, OK--no problem. Glad to be a guinea pig to help debug!

maurolepore commented 3 years ago

@JeffreyRStevens, it's my pleasure to be the handling editor of your submission.

Editor checks:

[x] Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package.
[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via a CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly

Editor comments

Congratulations! The bot and I are very happy to see the package meets rOpenSci's guidelines :+1:. I'll start looking for reviewers.

Note the following minor issues. You might want to consider before the review:

devtools::check() shows:

❯ checking Rd cross-references ... NOTE
  Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’

check_deduplicate() seems to have a relatively complex logic. It may be justified by the goal of the function (to check for various conditions) but might be worth ensuring there is no more logic than necessary.
Some if() statements seem fragile as they could get inputs of length greater than 1. Consider the argument dupl_ip to check_duplicates():

# Good
dupl_ip <- TRUE
if (identical(dupl_ip, TRUE)) {
  message("Do something.")
}
#> Do something.

# Good
dupl_ip <- TRUE
stopifnot(length(dupl_ip) == 1L)
if (dupl_ip) {
  message("Do something.")
}
#> Do something.

# Fragile
dupl_ip <- c(TRUE, FALSE)  # This might be accidentally non-atomic
if (dupl_ip == TRUE) {
  message("Do something.")
}
#> Warning in if (dupl_ip == TRUE) {: the condition has length > 1 and only the
#> first element will be used
#> Do something.

^{Created on 2021-08-07 by the reprex package (v2.0.0)}

Some lines are longer than 80 characters. Wrapping at 80 characters makes the code easier to review, without having to scroll right when navigating the script.
I see two methods to avoid the R CMD check NOTE about undefined global variables: var <- NULL (e.g.) and .data$var. Maybe best to use only one method?
In R/ I see a subdirectory (R/deprecated). I find this strange. How is it used?
devtools::test() shows:

check-mark-exclude: 
Note: Using an external vector in selections is ambiguous.
i Use `all_of(location_col)` instead of `location_col` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

The structure of the tests impairs the use of the convenient workflow with usethis::use_r() and usethis::use_test(). Consider structuring tests so that for each exported fun() in R/fun.R you have a test in tests/testthat/test-fun.R.

Reviewers:

@juliasilge
@jmobrien

Due date:

maurolepore commented 3 years ago

@ropensci-review-bot seeking reviewers

ropensci-review-bot commented 3 years ago

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/455_status.svg)](https://github.com/ropensci/software-review/issues/455)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

maurolepore commented 3 years ago

@JeffreyRStevens, could you suggest two or three potential reviewers? Although I wouldn't pick more than one, your list will inform the type of expertise you think would be useful when reviewing {excluder}.

I'll also use other criteria as described in How to look for reviewers.

JeffreyRStevens commented 3 years ago

@maurolepore thank you for your speedy initial review of {excluder}. I have pushed some changes to address your comments.

In running usethis::check(), I am not able to replicate your error regarding Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’. Would you mind checking if this is still an issue in the new version? If so, does this mean that you need to have the packages installed on your local machine to pass the check?
I'm not sure if it is possible to simplify the logic of check_duplicates(). I'm open to ideas from you or the reviewers, though.
Regarding the fragile if() statements, are you suggesting that I check if the dupl_id argument is identical to TRUE or of length 1. And, if so, I assume you'd like me to look at other if() statements to see if the same issue applies? Just want to clarify the action items for this issue.
I trimmed lines longer than 80 characters. The remaining lines > 80 characters are primarily messages/warnings or tests. Should I break those across multiple lines? Is the aim to reduce the number of lines > 80 or minimize the number of characters > 80? For example, in remove_label_rows.R lines 53-59, I broke this over multiple lines to minimize characters > 80, but it resulted in severa lines slightly larger that 80 characters rather than just line much larger. Which is preferred?
Regarding undefined global variables, I used exclusions <- NULL in collapse_exclusions.R because I can't figure out how to use .data in a way to avoid the R CMD note. If I use .data$exclusions to replace exclusions in the unite() function (line 55) or the column definition in the mutate() function (line 60), I receive a compiling error when running the check. The current version has removed the exclusions <- NULL, which results in the R CMD note. One option would be to reinsert the NULL definition and remove all .data from this particular function. But would I need to replace all .data with NULL definitions in all functions or just this one?
LOL--the directory R/deprecated was supposed to be added to the .gitignore file. It has now been removed.
I have addressed the all_of(location_col) warning in the test.
I have restructured the tests to have a test for each function.
I have added the rOpenSci peer review badge to the README.
Two potential reviewers would be the authors of {qualtRics}: Julia Silge and Jasper Ginn.

Thank you again for your careful review, and let me know if I missed something.

maurolepore commented 3 years ago

Thanks @JeffreyRStevens for responding quickly.

Yes, I still see that R CMD NOTE and two more. FYI I updated all my packages then run devtools::check(). Here is a reprex -- collapsed to save space.

Output

``` r devtools::load_all() #> ℹ Loading excluder packageVersion("excluder") #> [1] '0.2.2' gert::git_log(max = 1) #> # A tibble: 1 × 6 #> commit author time files merge message #> * #> 1 d99bc23850c3f… Jeffrey R. Ste… 2021-08-08 15:06:45 29 FALSE "Change all te… devtools::check() #> ℹ Updating excluder documentation #> ℹ Loading excluder #> Warning: [/home/mauro/git/excluder/R/check_duplicates.R:9] @details Link #> to unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_duration.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_ip.R:9] @details Link to unavailable #> package: qualtRics::fetch_survey. there is no package called 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_location.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_preview.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_progress.R:9] @details Link to #> unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/check_resolution.R:10] @details Link #> to unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Warning: [/home/mauro/git/excluder/R/qualtrics_numeric.R:3] @description Link #> to unavailable package: rgeolocate::ip2location. there is no package called #> 'rgeolocate' #> Warning: [/home/mauro/git/excluder/R/qualtrics_raw.R:3] @description Link #> to unavailable package: rgeolocate::ip2location. there is no package called #> 'rgeolocate' #> Warning: [/home/mauro/git/excluder/R/qualtrics_text.R:3] @description Link #> to unavailable package: rgeolocate::ip2location. there is no package called #> 'rgeolocate' #> Warning: [/home/mauro/git/excluder/R/remove_label_rows.R:7] @details Link #> to unavailable package: qualtRics::fetch_survey. there is no package called #> 'qualtRics' #> Writing NAMESPACE #> Writing NAMESPACE #> ── Building ──────────────────────────────────────────────────────── excluder ── #> Setting env vars: #> • CFLAGS : -Wall -pedantic #> • CXXFLAGS : -Wall -pedantic #> • CXX11FLAGS: -Wall -pedantic #> ──────────────────────────────────────────────────────────────────────────────── #> checking for file ‘/home/mauro/git/excluder/DESCRIPTION’ ... ✓ checking for file ‘/home/mauro/git/excluder/DESCRIPTION’ #> ─ preparing ‘excluder’: #> checking DESCRIPTION meta-information ... ✓ checking DESCRIPTION meta-information #> ─ installing the package to build vignettes #> creating vignettes ... ✓ creating vignettes (2.6s) #> ─ checking for LF line-endings in source and make files and shell scripts #> ─ checking for empty or unneeded directories #> ─ building ‘excluder_0.2.2.tar.gz’ #> #> ── Checking ──────────────────────────────────────────────────────── excluder ── #> Setting env vars: #> • _R_CHECK_CRAN_INCOMING_USE_ASPELL_: TRUE #> • _R_CHECK_CRAN_INCOMING_REMOTE_ : FALSE #> • _R_CHECK_CRAN_INCOMING_ : FALSE #> • _R_CHECK_FORCE_SUGGESTS_ : FALSE #> • NOT_CRAN : true #> ── R CMD check ───────────────────────────────────────────────────────────────── #> * using log directory ‘/tmp/Rtmpcusbks/excluder.Rcheck’ #> * using R version 4.1.0 (2021-05-18) #> * using platform: x86_64-pc-linux-gnu (64-bit) #> * using session charset: UTF-8 #> * using options ‘--no-manual --as-cran’ #> * checking for file ‘excluder/DESCRIPTION’ ... OK #> * this is package ‘excluder’ version ‘0.2.2’ #> * package encoding: UTF-8 #> * checking package namespace information ... OK #> * checking package dependencies ... OK #> * checking if this is a source package ... OK #> * checking if there is a namespace ... OK #> * checking for executable files ... OK #> * checking for hidden files and directories ... OK #> * checking for portable file names ... OK #> * checking for sufficient/correct file permissions ... OK #> * checking whether package ‘excluder’ can be installed ... OK #> * checking installed package size ... OK #> * checking package directory ... OK #> * checking for future file timestamps ... OK #> * checking ‘build’ directory ... OK #> * checking DESCRIPTION meta-information ... OK #> * checking top-level files ... NOTE #> Non-standard files/directories found at top level: #> ‘mid-guppy_reprex.R’ ‘mid-guppy_reprex.md’ ‘ok-coqui_reprex.R’ #> ‘ok-coqui_reprex.spin.R’ ‘ok-coqui_reprex.spin.Rmd’ #> * checking for left-over files ... OK #> * checking index information ... OK #> * checking package subdirectories ... OK #> * checking R files for non-ASCII characters ... OK #> * checking R files for syntax errors ... OK #> * checking whether the package can be loaded ... OK #> * checking whether the package can be loaded with stated dependencies ... OK #> * checking whether the package can be unloaded cleanly ... OK #> * checking whether the namespace can be loaded with stated dependencies ... OK #> * checking whether the namespace can be unloaded cleanly ... OK #> * checking loading without being on the library search path ... OK #> * checking dependencies in R code ... OK #> * checking S3 generic/method consistency ... OK #> * checking replacement functions ... OK #> * checking foreign function calls ... OK #> * checking R code for possible problems ... NOTE #> collapse_exclusions: no visible binding for global variable #> ‘exclusions’ #> Undefined global functions or variables: #> exclusions #> * checking Rd files ... OK #> * checking Rd metadata ... OK #> * checking Rd line widths ... OK #> * checking Rd cross-references ... NOTE #> Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ #> * checking for missing documentation entries ... OK #> * checking for code/documentation mismatches ... OK #> * checking Rd \usage sections ... OK #> * checking Rd contents ... OK #> * checking for unstated dependencies in examples ... OK #> * checking contents of ‘data’ directory ... OK #> * checking data for non-ASCII characters ... OK #> * checking LazyData ... OK #> * checking data for ASCII and uncompressed saves ... OK #> * checking installed files from ‘inst/doc’ ... OK #> * checking files in ‘vignettes’ ... OK #> * checking examples ... OK #> * checking for unstated dependencies in ‘tests’ ... OK #> * checking tests ... #> Running ‘testthat.R’ #> OK #> * checking for unstated dependencies in vignettes ... OK #> * checking package vignettes in ‘inst/doc’ ... OK #> * checking re-building of vignette outputs ... OK #> * checking for non-standard things in the check directory ... OK #> * checking for detritus in the temp directory ... OK #> * DONE #> #> Status: 3 NOTEs #> See #> ‘/tmp/Rtmpcusbks/excluder.Rcheck/00check.log’ #> for details. #> ── R CMD check results ───────────────────────────────────── excluder 0.2.2 ──── #> Duration: 56.5s #> #> > checking top-level files ... NOTE #> Non-standard files/directories found at top level: #> ‘mid-guppy_reprex.R’ ‘mid-guppy_reprex.md’ ‘ok-coqui_reprex.R’ #> ‘ok-coqui_reprex.spin.R’ ‘ok-coqui_reprex.spin.Rmd’ #> #> > checking R code for possible problems ... NOTE #> collapse_exclusions: no visible binding for global variable #> ‘exclusions’ #> Undefined global functions or variables: #> exclusions #> #> > checking Rd cross-references ... NOTE #> Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ #> #> 0 errors ✓ | 0 warnings ✓ | 3 notes x ```

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.0 (2021-05-18) #> os Ubuntu 20.04.2 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Mazatlan #> date 2021-08-10 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date lib source #> AsioHeaders 1.16.1-1 2020-07-07 [1] RSPM (R 4.1.0) #> askpass 1.1 2019-01-13 [1] CRAN (R 4.1.0) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> cachem 1.0.5 2021-05-15 [1] CRAN (R 4.1.0) #> callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> commonmark 1.7 2018-12-01 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> credentials 1.3.1 2021-07-25 [1] RSPM (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0) #> devtools 2.4.2 2021-06-07 [1] CRAN (R 4.1.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> P excluder * 0.2.2 2021-08-07 [?] local #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] RSPM (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> gert 1.3.1 2021-06-23 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) #> iptools 0.6.1 2018-12-09 [1] RSPM (R 4.1.0) #> janitor 2.1.0 2021-01-05 [1] CRAN (R 4.1.0) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> maps 3.3.0 2018-04-03 [1] RSPM (R 4.0.3) #> memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) #> openssl 1.4.4 2021-04-30 [1] CRAN (R 4.1.0) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) #> pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) #> processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) #> rcmdcheck 1.3.3 2019-05-07 [1] CRAN (R 4.1.0) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) #> readr 2.0.1 2021-08-10 [1] CRAN (R 4.1.0) #> remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0) #> roxygen2 7.1.1 2020-06-27 [1] CRAN (R 4.1.0) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.1.0) #> stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.5.1 2021-07-13 [1] CRAN (R 4.1.0) #> sys 3.4 2020-07-23 [1] CRAN (R 4.1.0) #> testthat * 3.0.4 2021-07-01 [1] CRAN (R 4.1.0) #> tibble 3.1.3 2021-07-23 [1] RSPM (R 4.1.0) #> tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> triebeard 0.3.0 2016-08-04 [1] RSPM (R 4.1.0) #> tzdb 0.1.2 2021-07-20 [1] RSPM (R 4.1.0) #> usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0) #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) #> xopen 1.0.0 2018-09-17 [1] CRAN (R 4.1.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] /home/mauro/R/x86_64-pc-linux-gnu-library/4.1 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> P ── Loaded and on-disk path mismatch. ```

-

Inside if() I suggest you use identical() not == because if() only takes objects of length 1. identical() always returns an object of length 1, but the output of == can be longer. If you do use == inside if() then be sure to first assert the input is indeed of length 1 (e.g. with stopifnot()).
When styling your code I suggest you stick to lines of 80 characters or less. The goal is to be friendly with the reviewers, who may adjust their screen to show the typical 80 characters. Longer lines force them to scroll right, which sounds like a minor issue but after a while can get annoying. The image below shows how the first line of code overflows my 80 characters marker. Under that I have the same code styled in different ways -- all fit in my screen.

In line 55 I think you need to quote "exclusions". In ?tidyr::unite() I see the argument col is The name of the new column, as a string or symbol. I thinks symbols (e.g. exclusions) are allowed for backward compatibility and I believe the tidyverse now recommnds reserving symbols for existing columns and strings (e.g. "exclusions") for new columns. Note the examples section of unite()'s helpflie uses strings ("z" and "xy") -- not symbols (z or xy). I tried "exclusions" and successfully removed the NOTE. Line 60 looks good to me.

Thanks for your work and suggestions for reviewers.

JeffreyRStevens commented 3 years ago

Many thanks for the clarifications and help, @maurolepore.

I believe that I fixed the R CMD note about qualtRics and rgeolocate. I was referencing their documentation, and if you don't have those packages installed, the documentation cannot be linked. So I changed the references to external URLs rather than internal documentation.
I have now changed all == operators to identical() in if() statements that use user inputted arguments. I have added stopifnot() statements when other operators (e.g., %in%) are used in if() statements with user inputted arguments.
I greatly reduced the number of lines over 80 characters but did not eliminate them all (e.g., when breaking a line would cause a message to print on two lines or would make readability much worse).
Thank you for this! The quotes fixed it.

maurolepore commented 3 years ago

@ropensci-review-bot add @juliasilge to reviewers

ropensci-review-bot commented 3 years ago

@juliasilge added to the reviewers list. Review due date is 2021-09-20. Thanks @juliasilge for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 3 years ago

@juliasilge: If you haven't done so, please fill this form for us to update our reviewers records.

maurolepore commented 3 years ago

@ropensci-review-bot add @jmobrien to reviewers

ropensci-review-bot commented 3 years ago

@jmobrien added to the reviewers list. Review due date is 2021-09-20. Thanks @jmobrien for accepting to review! Please refer to our reviewer guide.

ropensci-review-bot commented 3 years ago

@jmobrien: If you haven't done so, please fill this form for us to update our reviewers records.

maurolepore commented 3 years ago

@JeffreyRStevens, I'm thrilled that @juliasilge and @jmobrien accepted to review the excluder package.

Note @ropensci-review-bot set the due date following rOpenSci's guidelines to 2021-09-20. However, these are difficult times and I would like to allow reviewers 1 extra week. Feel free to submit your review whenever ready but if you don't I'll touch base on 2021-09-27.

I look forward to working with you all.

juliasilge commented 3 years ago

Package Review

Congratulations to the author on this useful package for folks handling Qualtrics surveys. It will be convenient to have all these common checks in a consistent set of functions, and the messaging in the console is very nice. 🙌

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

On documentation, I find some of the way the documentation works a bit confusing, especially when things go wrong for users. For example, if I have a dataset with no IP addresses or location information and I try to run exclude_duplicates(), I get this error:

Error in check_duplicates(x, quiet = TRUE, ...) : 
The column specifying location ('location_col') was not found.

However, when I look at the documentation for the function I used, I don't see anything about location_col and it's not clear which link I should click to find the relevant info. Some options might be to change the documentation for the dots or write a more clear error message.

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 2

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

🎯 The function name collapse_exclusions() reminds me a bit too much of dplyr::collapse() when what it does is much more like tidyr::unite(). What about a name like unite_exclusions()? Along those lines, did you consider names like filter_*() instead of exclude_*()? A benefit to using names that align with other names in the ecosystem is that these functions then slot right in to people's mental model of what they are doing.

🎯 None of my real surveys have IP addresses so I was not able to check the iptools integration beyond the example survey data included in the package.

🎯 In the future, if you wanted to increase the polish of the console messages (once the nicest features of this package that I think will draw folks to use it), you might check out using cli.

maurolepore commented 3 years ago

@juliasilge thanks for your review. @jmobrien when do you think you'll be able to submit yours?

jmobrien commented 3 years ago

I will have it in by later today, thanks.

On Mon, Sep 27, 2021 at 2:18 PM Mauro Lepore @.***> wrote:

@juliasilge https://github.com/juliasilge thanks for your review. @jmobrien https://github.com/jmobrien when do you think you'll be able to submit yours?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/455#issuecomment-928196989, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWEMIWAZC4NNRGMQX7YCLDUEC7RZANCNFSM5BA7NRLQ .

jmobrien commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions (but see below)
[X] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines (but see below).

Estimated hours spent reviewing: 7

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Overall

I quite like this package. It fills a niche for Qualtrics data cleaning that is difficult to automate and/or make auditable/reproducible, given how Qualtrics designed server-side data filters.

As the author notes, in human participants research it's also important to not unduly exclude data, and the manual review this helps facilitate is supportive of that ethical imperative. So, generally, I think, the paradigm of excluder makes a lot of sense--a simple interface for early-stage review and removal of problem cases.

The package seemed to work as expected. None of the data I had available (apparently) had any non-US participants, so I also was not able to test IP-checking as thoroughly as might be warranted, but it seems like a sound implementation.

A few notes

A few things that might warrant either review or improvement. Nothing terribly serious or strictly essential for fixing right now--though with one issue (centering everything around the check_*() functions), I do wonder that it might make growing/maintaining this package harder in the long run.

Style & dependencies
- seem to be a bit inconsistent
- mostly choices to use tidyverse vs. base functions, e.g.:
- dplyr::bind_cols() is mixed with rbind().
- dplyr::pull() is regularly used for indexing that could be handled by [[]].
- Nothing seems strictly wrong, but the motivation for each choice wasn't always clear
some function calls reference function re-exports rather than the originals
- e.g., dplyr::all_of() instead of tidyselect::all_of().
- Not currently clear which approach is preferable, but thought it worth mentioning.
Interaction b/w check, mark, and exclude verbs internally & in workflow:
- Again, I like the verbs as an overarching approach.
- However, considering Mozilla guidelines on simplicity/efficiency, I do wonder about everything depending on the check_*() functions
- Outlining here:
- mark_*() functions pass everything to check_*(), extracts a specified ID variable from that, adds a marker text variable, then merges this back with left_join()
- exclude_*() functions pass everything to check_*() then uses anti_join() on just the ID variable
- I think this arrangement might unnecessarily complicate some tasks. For example:
- Output from mark_*() functions end up with NA in non-marked rows of exclusion_ variables. This makes manual filtering on these columns more difficult.
- mark_duration() can't easily mark the difference b/w a too-short and a too-long response, which might be a useful thing to know.
- mark_*() and exclude_*() use ellipses for arguments to be passed to check_*(), meaning you have to look up check_*() to get comprehensive help on how mark_*() and exclude_*() operate (this specific issue could probably be fixed with a few explicit arguments in each function, even if everything else is left the same)
- [personal opinion here]-- could it be simpler & more useful if package functionality built on the mark_*() functions, rather than check_*()?
  - mark_*() functions could cover all reporting. This could even enable some time-saving advantages from summarize()-like reporting, (e.g., could see at a glance if all short-duration cases were those also out-of-US).
  - exclusion_* variables from mark could use booleans, 1/0, (or even just ""), for easier support of manual filtering if desired
  - check_*() and exclude_*() could be simplified:
    - Under the current design (3 functions for each characteristic), check and exclude could just call mark internally plus something like dplyr::filter().
    - Potentially, check and exclude could even be reduced to just 2 general-purpose that keep/exclude based on any added exclusion_ variables
  - Efficiency might be improved by avoiding duplicating passes on the data-- maybe not noticeable enough to matter for smaller use cases (such as those more typical for we psychologists), potentially useful for potential users who work regularly with larger-scale datasets.
Few small other things:
- use of & instead of && in if statements that should only field one comparison
- descriptions of check_ip(), exclude_ip(), and mark_ip() are backwards--each retains/drops cases from outside the specified country, not from inside it as indicated in the documentation
- the rOpenSci packaging guidelines recommends an [object]_[verb] structure, while this uses the reverse. Personally, I think current naming is fine, though.
- Packaging guidelines generally recommend not having a separate file for each function, but having at least a bit of conceptual organization. The names help here, but it did slow down my navigating the package.

maurolepore commented 3 years ago

@jmobrien thanks a lot for your review.

@JeffreyRStevens, you can now go ahead and address reviewer's comments. rOpenSci's guidelines seem to not limit how long you should take, but I think it's best to respond ASAP -- while the reviewers keep the context in their heads.

Please let me know if you have questions.

JeffreyRStevens commented 3 years ago

Yes, many thanks @juliasilge and @jmobrien for your helpful reviews! I've done a quick skim and will review in-depth very soon. I may have a few questions before I get too far into revisions.

JeffreyRStevens commented 3 years ago

@juliasilge, many thanks again for your helpful comments. Here are some responses/questions for your points:

Confusing error/documentation for exclude_duplicates(): I think I see the issue here, and I believe that addressing some of @jmobrien's concerns about the functions will address this. I'll return to this comment after I address his concerns.
Rename collapse_exclusions(): This was a great idea. I have now renamed this to unite_exclusions() and marked collapse_exclusions() as deprecated using the {lifecycle} package. I have now push this change.
Rename exclude_*() functions to filter_*(): I really like this idea, too. But one question that I had is whether I should rather use filter_*() to replace the check_*() functions since the tidyverse filter() function retains the specific rows rather than removes them. In this sense, it seems to be more equivalent to check_*() rather than exclude_*(), in that check_*() retains the rows of data that are being checked. Or maybe I'm thinking of it backwards. What are your thoughts on this?
Thank you so much for pointing me to {cli}. I hadn't seen that before, and it looks like a very useful package for customizing messages. I'll definitely be figuring out ways to integrate that into the console messages.

I'll have some questions for @jmobrien in the coming days.

juliasilge commented 3 years ago

I was really happy to see @jmobrien's thorough review and I agree that working through some of his suggestions will address the problems with the documentation and error messages.

I'm not sure what we really mean by a difference between "retain" vs. "throw away" (just reverse the condition, which in the case of, say, IP addresses in/out of the US doesn't seem like a super one-way thing) but I think I see your point. I guess my opinion is that creating three new verbs for operations that are arguably the same as things many people already use makes this toolkit less easy to use than if they were given names that allude to the ways in which they are the same as other operations. Another option may be to use mark_ + filter_ + exclude_?

jmobrien commented 3 years ago

I also will say that I generally agreed with all that @juliasilge said in her review and thought it was on point for usefulness. Where I didn't echo her sentiments directly, it was mostly for brevity because I was the second to submit an already-long review. I especially agree that the console reporting is a big sell here.

Here, @JeffreyRStevens, I'm understanding you as saying exclude_duration([longer than 30 mins]) would remove all >30 minute cases, while filter([duration > 30 mins]) would keep those same cases and discard others. So, at least at default, the behavior of filter is more conceptually similar to check than exclude. That right? If so, I see where you're coming from also.

From a more data-centric perspective on the potential userbase, I agree w/where Julia's coming from. The package does things very close to core tidyverse operations for data wrangling. There's something to be said for capitalizing on what those names invoke.

With my social scientist had on, though, I feel much more positively about the existing names. This whole package really just seems like a batch of convenience functions that abstract lower-level data wrangling operations into key tasks for ethically managing human-participants data. As tasks, I think the names make sense: manually "check" problems first, then "exclude" them prior to analysis.

I don't have a strong opinion about which way naming should go. Though FWIW, I suppose similar ideas are embedded in my speculation whether functionality should center on mark. With that workflow, the check and exclude verbs could keep their core purpose as convenience functions called downstream of mark (give me all the marked stuff for review vs. get rid of it). But if you find yourself needing something more complex, you can just do a standard call to filter instead.

(Or, if you're feeling really ambitious, maybe you could make everyone happy by adding something like filter_exclusions() that could emulate the typical filter syntax for more complex cases where the convenience functions fall short? So, e.g., filter_exclusions(!ip, duration) could mean "give me all the cases that were NOT marked as out of the acceptable IP address space, but which were still too long", etc. Could maybe also work around whether unite_exclusions() has been called yet or not.)

juliasilge commented 3 years ago

I don't know that I'd recommend adding another set of functions to maintain here. And for the record, my opinion on the naming is not super strong either; I'm raising this mostly because this is something I've run into before and in the past I have always regretted when I have not used what already exists in terms of vocabulary, norms, idioms, function names, etc. and then later realized I could have.

JeffreyRStevens commented 3 years ago

Thanks you two for your thoughts on this. Yes, my thought was that filter_, mark_, and exclude_ would more closely match the way filter() works (that is, it retains rows satisfying the condition). But I also share @jmobrien's point that check_ might make more sense from a general user perspective even though filter_ makes sense from a tidyverse user/developer perspective. I'll grapple with this in the coming days. I really like the idea of centering the functions around mark_ rather than check_. @jmobrien, just to clarify, you were also suggesting removing all of the check_ and exclude_ functions and instead only have two functions: check and exclude, correct? So basically, you would always run the mark_ functions to find the exclusions, then pipe that to check to view/store the exclusions or pipe to exclude to remove the exclusions. Is this what you were suggesting? This sounds much easier to maintain and probably to use as well.

jmobrien commented 3 years ago

That was one option, yes. I guess in that simplifying approach check and exclude would need to recognize what parts that mark functions had looked at previously. You could set that up a number of different ways.

You could also preserve the existing check/exclude function sets if you prefer a more standalone design where each verb can operate on fresh data directly. Even then, as outlined above I think there are several benefits to having check and exclude call mark internally, rather than mark and exclude calling check.

There's certainly some draw to option 1. But both options have unique advantages, and I don't want to dictate your design choices.

Option 1's a more major change, though, so if you're leaning that way I might get others' opinions first.

JeffreyRStevens commented 3 years ago

I'm close to finishing up revisions. I have a couple more changes to make and had two questions for @jmobrien.

When you referred to avoiding multiple passes on the data, did you mean within a function or are you concerned about running multiple mark_*() functions?
When you referred to using ellipses to pass arguments to (now) check_*() and exclude+*() functions, you suggested explicitly including some arguments. Can you point me to how to do this? My efforts to do this have not worked out.

jmobrien commented 3 years ago

@JeffreyRStevens, responses below:

I think the extra joins in the current design might slow things down, but I don't know how much. And having just unified check and exclude functions would in theory be slightly more efficient than independent ones, because, e.g., the ip address checking code would only be called at the start when the dataset is marked. My guess is this isn't a big deal in most use cases, though, and probably even less than I was thinking it might be even with big datasets. So, it's a point of optimization, but not a big priority.
Can you clarify what's giving you trouble? In many cases it can be handled straightforwardly. Here's an example from qualtRics, where fetch_survey() has arguments that get passed directly to read_survey(), using the same argument names in both functions: https://github.com/ropensci/qualtRics/blob/master/R/fetch_survey.R

JeffreyRStevens commented 3 years ago

Thanks, @jmobrien. I figured out (2) with your examples. I expect to have everything wrapped up in the next week.

JeffreyRStevens commented 3 years ago

OK, I think that I have addressed all of the concerns. Please let me know if I missed something or you have additional questions. Note that I have increased the version to 0.3.0 since I have made major changes to the code and deprecated a function.

Responses to @juliasilge.

Confusing error/documentation for exclude_duplicates(): This has been addressed by explicitly including arguments from mark_*(). Commit 1b74c4ce
Rename collapse_exclusions(): This was a great idea. I have now renamed this to unite_exclusions() and marked collapse_exclusions() as deprecated using the {lifecycle} package. Commit 68aae92b.
Rename exclude_*() functions to filter_*(): I'm still torn on this, but I think that I will keep the names as originally named. In part, this is due to the confusion that we've had about whether check or exclude should be changed to filter. I now use filter for both types of actions. Maybe a compromise is to use the word 'filter' more in the documentation?
Thank you so much for pointing me to {cli}. I hadn't seen that before, and it looks like a very useful package for customizing messages. I'll definitely be figuring out ways to integrate that into the console messages.

Responses to @jmobrien

Using both dplyr::bind_cols() and rbind(): I have switched all rbind() uses to dplyr::bind_cols. Commit 89294932
Using dplyr::pull() instead of [[]]: I have switched all dplyr::pull() uses to [[]]. Commit ec655f9b
Referencing function re-exports instead of original package: I have changed all references to dplyr::all_of() and dplyr::any_of() to tidyselect::all_of() and tidyselect::any_of(). Are there ways to easily find the original package for all of your functions in a script? Commit face90d
Move primary functionality from check_*() to mark_*() functions: I have now done this. The check_*() and exclude_*() functions now just run the mark_*() functions and filter exclusions in or out. Commit 877e3b9
Mark functions result in NAs when no exclusions apply: The NAs have now been replaced with "". Commit 877e3b9
mark_duration() does not differentiate between too long and too short: mark_duration() now marks quick and slow durations separately. Commit d45e44a6
mark_*() and exclude_*() use ellipses for arguments to be passed to check_*(): This has been addressed by explicitly including arguments from mark_*(). Commit 1b74c4ce
Create two general-purpose functions check and exclude: I've decided to hold off on this for the moment. Instead, I have offered an example in the vignette on piping multiple mark_*() functions and then check with a filter. Commit 12b4ad2b
Avoid duplicate passes on data: This may be an optimization issue to deal with later.
Use && instead of &: I have now replaced & with &&. Commit b9aa1b04
The ip functions were inaccurately described in documentation: I have fixed the documentation to clarify that these functions find cases outside of the specified country. Commit 3b486a20
Reverse order of function names to [object]_[verb]: While I saw that this was a preference for rOpenSci packages, it does not seem natural for this set of functions. [verb]_[object] both categorizes the verbs better and matches {tidyverse} syntax (e.g., bind_rows()/bind_cols()) as well as the {qualtRics} syntax (e.g., fetch_survey()).
Organize related functions into files: I have now organized the functions by the exclusion type (e.g., duplicates.R). Commit 3c88d484

maurolepore commented 3 years ago

Thanks everyone! It's been great to watch this review.

Dear reviewers @juliasilge and @jmobrien,

At this stage, we ask that you respond as to whether the changes sufficiently address any issues raised in your review. We encourage ongoing discussion between package authors and reviewers, and you may ask editors to clarify issues in the review thread as well. -- https://devguide.ropensci.org/reviewerguide.html#followupreviewer

jmobrien commented 3 years ago

Generally, I think this looks quite good. I just want to do a bit more due diligence on the recentering to mark before giving final approval, since that was the most major suggestion from me.

I did happen to look at the *_ip() family of functions about a week ago (following the reorganization, just before @JeffreyRStevens 's most recent post). One thing I noticed was that the transition towards mark didn't end up streamlining the internals of data processing in quite the way I expected.

BUT--I then fiddled with it briefly myself, and started to think my expectations were mis-calibrated. In the *_ip() functions, for instance, the iptools functions being used have some unforgiving aspects that needed designing around. I didn't previously appreciate that issue, and it's changed my perspective quite a bit.

So, I still want to complete a full look-though, but I'm feeling pretty optimistic that everything should be fine.

juliasilge commented 3 years ago

I now have a survey that has some IP addresses and the iptools integration looks good. 👍
Using mark_*() as a function to build off of looks like it worked really well but now there is a fair amount of repeated code that you might want to abstract out into functions. Like for exclude_*(), wrap something like this in if (identical(silent, FALSE)):

print_exclusion <- function(remaining_data, x, msg) {
  n_remaining <- nrow(remaining_data)
  n_exclusions <- nrow(x) - n_remaining
    message(
      n_exclusions, " out of ", nrow(x),
      msg, n_remaining, " rows."
    )
}

You might want to make something like that for all the invisible returning too.

Speaking of which, when I returned to the package after not using it for a while, I found it really confusing that some of these functions return the data invisibly but others don't (picture me running exclude_duration(survey_raw, min_duration = 100) and thinking "where is my data?"). This was confusing the first time I reviewed too so I probably should have mentioned it then 🙈 but I wonder if this should be reevaluated. A switch in default to print = TRUE?

JeffreyRStevens commented 3 years ago

@juliasilge, thank you for the follow up. That is a great idea to start building more utility functions to reduce repeated code. I'm now working on both the print_exclusion() function and the invisible returning (and will look for other opportunities to replace repeated code with functions). As a side note, I started using the {cli} package to retool the print exclusion messages, and I really like it. So I'm going to go ahead and work at revamping all of the messages.

Yes, I switched the default for the exclude_*() functions to be print = FALSE in a recent update. My thinking was that you probably do want to actually see the 'problematic' rows when using the check_*() functions because you are checking. You don't need to see the rows when using the mark_*() functions because it returns the whole data set with columns added to the end (which would be difficult to see when printed to the console). When excluding, you likely don't need to see the remaining rows because (1) they are not problematic and (2) likely the vast majority of the original data will not be excluded, so there will be a lot of rows that would be printed to the console. But if you think that it makes more sense to return to printing the output of the exclude_*() functions, I can do that---especially if you were expecting to see the output.

maurolepore commented 3 years ago

@JeffreyRStevens, have you seen this?:

Ellipses:

https://adv-r.hadley.nz/functions.html?q=...#fun-dot-dot-dot

When to print the return value or return the first argument invisibly:

JeffreyRStevens commented 3 years ago

@juliasilge I have now created four new utility functions (in utils.R) that clean up code in the verb functions.

validate_columns() is used in the mark_*() functions to check the number, names, and types of input columns before data processing.
mark_rows() is used the mark_*() functions to create the new column that marks rows that meet exclusion criteria.
print_exclusions() is used in the exclude_*() functions to print a message to the console with the number of excluded rows.
print_data() is used in the check_*() and exclude_*() functions to print the data to the console.

I have also switched all messages to the cli::cli_alert_info() syntax (though error messages continue to use stop()).

I have made print = TRUE the default for all exclude_*() functions.

juliasilge commented 3 years ago

These new utility functions look great! 🙌 There is a lot less repeated code and I think this will make the package much more maintainable and easy to contribute to.
The new cli use turned out slick; I have had such good experience using that package and I hope that it works out as well for you.
I still am not sure about how the functions return/not return invisibly. It is confusing and unexpected to me that they have different behaviors by default. Those links that Mauro shared are super good and important here; these are not functions that are called largely for their side effects (like writing a file to disk) but rather we directly want that function output (the data) to use in a pipeline, to summarize, just generally to compute on. I was, as I tried to use all three function types, automatically looking for: how many rows there were, how many columns there were, what the new columns created by mark_*() were called, etc. I do think all the functions should have the same default, and my opinion is that the right call is not to return invisibly.

JeffreyRStevens commented 3 years ago

@juliasilge

Thanks! They certainly have made the subsequent edits much easier!
Thanks again for recommending this.
Yes, that makes a lot of sense. I have now made the mark_*() functions return not invisibly. So all three verbs should return explicitly by default.

@jmobrien, will you be able to wrap up your review soon?

JeffreyRStevens commented 3 years ago

@maurolepore, what are the next steps now? FYI, I'm working on a paper for JOSS. I'm not sure if that should be part of the review here or just over at JOSS.

maurolepore commented 3 years ago

@JeffreyRStevens,

I'm watching your discussion with the reviewers. Do you feel the ball is no longer on your court? If so I would ask the reviewers once again whether your changes sufficiently address any issues they raised in the review. Please confirm and I'll write a new comment with a @mention to them.

We no longer review submissions to JOSS, but do mention the review -- I think you might get fast-tracking.

JeffreyRStevens commented 3 years ago

@maurolepore, yes, I think that I've addressed the reviewers concerns, and I'm waiting for their final approval. Thanks.

maurolepore commented 3 years ago

Dear reviewers,

@jmobrien, do you have an estimate about when you might be able to confirm whether the changes made are sufficient to approve this package?

@juliasilge, it seems your last concern (return visibly) has been addressed. Is there anything else you'd like to request or do you confirm the changes made are sufficient to approve this package?

jmobrien commented 3 years ago

Mauro, and Jeffrey, apologies--I've had a remarkable variety of unexpected home and family issues come up over just this past week. I'm about to go handle what I hope is the last of them, but I'll make it a priority to finish everything tomorrow. Thanks for your patience.

On Sat, Oct 30, 2021 at 8:37 PM Mauro Lepore @.***> wrote:

Dear reviewers,

@jmobrien https://github.com/jmobrien, do you have an estimate about when you might be able to confirm whether the changes made are sufficient to approve this package?

@juliasilge https://github.com/juliasilge, it seems your last concern (return visibly) has been addressed. Is there anything else you'd like to request or do you confirm the changes made are sufficient to approve this package?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/455#issuecomment-955620821, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWEMISKFWMP2CRZQHAGBG3UJSMWRANCNFSM5BA7NRLQ .

juliasilge commented 3 years ago

For me, the changes have addressed the issues I was interested in, and I approve it moving forward. ✅

ropensci / software-review