Closed JeffreyRStevens closed 3 years ago
@ropensci-review-bot check package
Thanks, about to send the query.
git hash: 1d8446c9
Important: All failing checks above must be addressed prior to proceeding
Package License: GPL (>= 3)
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
The package has: - code in R (100% in 35 files) and - 1 authors - 1 vignette - 3 internal data files - 8 imported packages - 54 exported functions (median 16 lines of code) - no non-exported function in R (median 20 lines of code) --- Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages The following terminology is used: - `loc` = "Lines of Code" - `fn` = "function" - `exp`/`not_exp` = exported / not exported The final measure (`fn_call_network_size`) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile. |measure | value| percentile|noteworthy | |:------------------------|-----:|----------:|:----------| |files_R | 35| 91.1| | |files_vignettes | 1| 64.8| | |files_tests | 6| 81.5| | |loc_R | 607| 50.7| | |loc_vignettes | 118| 57.1| | |loc_tests | 252| 57.1| | |num_vignettes | 1| 60.7| | |data_size_total | 18483| 74.3| | |data_size_median | 6139| 75.9| | |n_fns_r | 54| 50.8| | |n_fns_r_exported | 54| 88.4| | |n_fns_r_not_exported | 0| 0.0|TRUE | |n_fns_per_file_r | 1| 0.0|TRUE | |num_params_per_fn | 4| 54.3| | |loc_per_fn_r | 18| 69.6| | |loc_per_fn_r_exp | 16| 39.6| | |loc_per_fn_r_not_exp | 20| 77.1| | |rel_whitespace_R | 19| 53.9| | |rel_whitespace_vignettes | 34| 75.0| | |rel_whitespace_tests | 15| 78.7| | |doclines_per_fn_exp | 48| 61.4| | |doclines_per_fn_not_exp | 0| 0.0|TRUE | |fn_call_network_size | 40| 57.0| | ---
Interactive network visualisation of calls between objects in package can be viewed by clicking here
goodpractice
and other checks### 3a. Continuous Integration Badges [![github](https://github.com/jeffreyrstevens/excluder/workflows/R-CMD-check/badge.svg)](https://github.com/jeffreyrstevens/excluder/actions) **GitHub Workflow Results** |name |conclusion |sha |date | |:-------------|:----------|:------|:----------| |pkgdown |success |1d8446 |2021-07-26 | |R-CMD-check |success |1d8446 |2021-07-26 | |test-coverage |success |1d8446 |2021-07-26 | --- ### 3b. `goodpractice` results ### `R CMD check` with [rcmdcheck](https://r-lib.github.io/rcmdcheck/) R CMD check generated the following note: 1. checking Rd cross-references ... NOTE Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’ ### Test coverage with [covr](https://covr.r-lib.org/) Package coverage: 81.79 ### Cyclocomplexity with [cyclocomp](https://github.com/MangoTheCat/cyclocomp) The following function have cyclocomplexity >= 15: function | cyclocomplexity --- | --- check_duplicates | 19 ### Static code analyses with [lintr](https://github.com/jimhester/lintr) [lintr](https://github.com/jimhester/lintr) found the following 201 potential issues: message | number of times --- | --- Lines should not be more than 80 characters. | 201
|package |version | |:--------|:---------| |pkgstats |0.0.0.265 | |pkgcheck |0.0.1.367 |
This package may be submitted
The only check that failed was "Package does not have a 'contributing.md' file". Does this mean that the package should not have a contributing.md file? So should I just remove the file and remove all references to Contributing to this package?
Thanks for the submission @JeffreyRStevens! Sorry for a bit of a delay as we were working out our new automated diagnostics bot, which your review is the first to use, and you just found a bug in! Your CONTRIBUTING.md
file is fine, we are just failing to check the right subfolder. We'll move ahead with this :)
Ah, OK--no problem. Glad to be a guinea pig to help debug!
@JeffreyRStevens, it's my pleasure to be the handling editor of your submission.
Congratulations! The bot and I are very happy to see the package meets rOpenSci's guidelines :+1:. I'll start looking for reviewers.
Note the following minor issues. You might want to consider before the review:
devtools::check()
shows:❯ checking Rd cross-references ... NOTE
Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’
check_deduplicate()
seems to have a relatively complex logic. It may be justified by the goal of the function (to check for various conditions) but might be worth ensuring there is no more logic than necessary.
Some if()
statements seem fragile as they could get inputs of length greater than 1. Consider the argument dupl_ip
to check_duplicates()
:
# Good
dupl_ip <- TRUE
if (identical(dupl_ip, TRUE)) {
message("Do something.")
}
#> Do something.
# Good
dupl_ip <- TRUE
stopifnot(length(dupl_ip) == 1L)
if (dupl_ip) {
message("Do something.")
}
#> Do something.
# Fragile
dupl_ip <- c(TRUE, FALSE) # This might be accidentally non-atomic
if (dupl_ip == TRUE) {
message("Do something.")
}
#> Warning in if (dupl_ip == TRUE) {: the condition has length > 1 and only the
#> first element will be used
#> Do something.
Created on 2021-08-07 by the reprex package (v2.0.0)
Some lines are longer than 80 characters. Wrapping at 80 characters makes the code easier to review, without having to scroll right when navigating the script.
I see two methods to avoid the R CMD check NOTE about undefined global variables: var <- NULL
(e.g.) and .data$var
. Maybe best to use only one method?
In R/ I see a subdirectory (R/deprecated). I find this strange. How is it used?
devtools::test()
shows:
check-mark-exclude:
Note: Using an external vector in selections is ambiguous.
i Use `all_of(location_col)` instead of `location_col` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
usethis::use_r()
and usethis::use_test()
. Consider structuring tests so that for each exported fun()
in R/fun.R you have a test in tests/testthat/test-fun.R.Reviewers:
Due date:
@ropensci-review-bot seeking reviewers
Please add this badge to the README of your package repository:
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/455_status.svg)](https://github.com/ropensci/software-review/issues/455)
Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news
@JeffreyRStevens, could you suggest two or three potential reviewers? Although I wouldn't pick more than one, your list will inform the type of expertise you think would be useful when reviewing {excluder}.
I'll also use other criteria as described in How to look for reviewers.
@maurolepore thank you for your speedy initial review of {excluder}. I have pushed some changes to address your comments.
usethis::check()
, I am not able to replicate your error regarding Packages unavailable to check Rd xrefs: ‘qualtRics’, ‘rgeolocate’
. Would you mind checking if this is still an issue in the new version? If so, does this mean that you need to have the packages installed on your local machine to pass the check?check_duplicates()
. I'm open to ideas from you or the reviewers, though. if()
statements, are you suggesting that I check if the dupl_id
argument is identical to TRUE or of length 1. And, if so, I assume you'd like me to look at other if()
statements to see if the same issue applies? Just want to clarify the action items for this issue.remove_label_rows.R
lines 53-59, I broke this over multiple lines to minimize characters > 80, but it resulted in severa lines slightly larger that 80 characters rather than just line much larger. Which is preferred?exclusions <- NULL
in collapse_exclusions.R
because I can't figure out how to use .data
in a way to avoid the R CMD note. If I use .data$exclusions
to replace exclusions
in the unite()
function (line 55) or the column definition in the mutate()
function (line 60), I receive a compiling error when running the check. The current version has removed the exclusions <- NULL
, which results in the R CMD note. One option would be to reinsert the NULL
definition and remove all .data
from this particular function. But would I need to replace all .data
with NULL
definitions in all functions or just this one?.gitignore
file. It has now been removed. all_of(location_col)
warning in the test.Thank you again for your careful review, and let me know if I missed something.
Thanks @JeffreyRStevens for responding quickly.
devtools::check()
. Here is a reprex -- collapsed to save space.-
Inside if()
I suggest you use identical()
not ==
because if()
only takes objects of length 1. identical()
always returns an object of length 1, but the output of ==
can be longer. If you do use ==
inside if()
then be sure to first assert the input is indeed of length 1 (e.g. with stopifnot()
).
When styling your code I suggest you stick to lines of 80 characters or less. The goal is to be friendly with the reviewers, who may adjust their screen to show the typical 80 characters. Longer lines force them to scroll right, which sounds like a minor issue but after a while can get annoying. The image below shows how the first line of code overflows my 80 characters marker. Under that I have the same code styled in different ways -- all fit in my screen.
?tidyr::unite()
I see the argument col
is The name of the new column, as a string or symbol. I thinks symbols (e.g. exclusions
) are allowed for backward compatibility and I believe the tidyverse now recommnds reserving symbols for existing columns and strings (e.g. "exclusions") for new columns. Note the examples section of unite()
's helpflie uses strings ("z" and "xy") -- not symbols (z
or xy
). I tried "exclusions" and successfully removed the NOTE. Line 60 looks good to me.Thanks for your work and suggestions for reviewers.
Many thanks for the clarifications and help, @maurolepore.
==
operators to identical()
in if()
statements that use user inputted arguments. I have added stopifnot()
statements when other operators (e.g., %in%
) are used in if()
statements with user inputted arguments.@ropensci-review-bot add @juliasilge to reviewers
@juliasilge added to the reviewers list. Review due date is 2021-09-20. Thanks @juliasilge for accepting to review! Please refer to our reviewer guide.
@juliasilge: If you haven't done so, please fill this form for us to update our reviewers records.
@ropensci-review-bot add @jmobrien to reviewers
@jmobrien added to the reviewers list. Review due date is 2021-09-20. Thanks @jmobrien for accepting to review! Please refer to our reviewer guide.
@jmobrien: If you haven't done so, please fill this form for us to update our reviewers records.
@JeffreyRStevens, I'm thrilled that @juliasilge and @jmobrien accepted to review the excluder package.
Note @ropensci-review-bot
set the due date following rOpenSci's guidelines to 2021-09-20. However, these are difficult times and I would like to allow reviewers 1 extra week. Feel free to submit your review whenever ready but if you don't I'll touch base on 2021-09-27.
I look forward to working with you all.
Congratulations to the author on this useful package for folks handling Qualtrics surveys. It will be convenient to have all these common checks in a consistent set of functions, and the messaging in the console is very nice. 🙌
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).On documentation, I find some of the way the documentation works a bit confusing, especially when things go wrong for users. For example, if I have a dataset with no IP addresses or location information and I try to run exclude_duplicates()
, I get this error:
Error in check_duplicates(x, quiet = TRUE, ...) :
The column specifying location ('location_col') was not found.
However, when I look at the documentation for the function I used, I don't see anything about location_col
and it's not clear which link I should click to find the relevant info. Some options might be to change the documentation for the dots or write a more clear error message.
Estimated hours spent reviewing: 2
🎯 The function name collapse_exclusions()
reminds me a bit too much of dplyr::collapse()
when what it does is much more like tidyr::unite()
. What about a name like unite_exclusions()
? Along those lines, did you consider names like filter_*()
instead of exclude_*()
? A benefit to using names that align with other names in the ecosystem is that these functions then slot right in to people's mental model of what they are doing.
🎯 None of my real surveys have IP addresses so I was not able to check the iptools integration beyond the example survey data included in the package.
🎯 In the future, if you wanted to increase the polish of the console messages (once the nicest features of this package that I think will draw folks to use it), you might check out using cli.
@juliasilge thanks for your review. @jmobrien when do you think you'll be able to submit yours?
I will have it in by later today, thanks.
On Mon, Sep 27, 2021 at 2:18 PM Mauro Lepore @.***> wrote:
@juliasilge https://github.com/juliasilge thanks for your review. @jmobrien https://github.com/jmobrien when do you think you'll be able to submit yours?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/455#issuecomment-928196989, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWEMIWAZC4NNRGMQX7YCLDUEC7RZANCNFSM5BA7NRLQ .
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions (but see below)
[X] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).
Estimated hours spent reviewing: 7
I quite like this package. It fills a niche for Qualtrics data cleaning that is difficult to automate and/or make auditable/reproducible, given how Qualtrics designed server-side data filters.
As the author notes, in human participants research it's also important to not unduly exclude data, and the manual review this helps facilitate is supportive of that ethical imperative. So, generally, I think, the paradigm of excluder
makes a lot of sense--a simple interface for early-stage review and removal of problem cases.
The package seemed to work as expected. None of the data I had available (apparently) had any non-US participants, so I also was not able to test IP-checking as thoroughly as might be warranted, but it seems like a sound implementation.
A few things that might warrant either review or improvement. Nothing terribly serious or strictly essential for fixing right now--though with one issue (centering everything around the check_*()
functions), I do wonder that it might make growing/maintaining this package harder in the long run.
Style & dependencies
dplyr::bind_cols()
is mixed with rbind()
. dplyr::pull()
is regularly used for indexing that could be handled by [[]]
. some function calls reference function re-exports rather than the originals
dplyr::all_of()
instead of tidyselect::all_of()
.Interaction b/w check
, mark
, and exclude
verbs internally & in workflow:
check_*()
functionsmark_*()
functions pass everything to check_*()
,
extracts a specified ID variable from that, adds a marker text variable,
then merges this back with left_join()
exclude_*()
functions pass everything to check_*()
then uses anti_join()
on just the ID variablemark_*()
functions end up with NA
in non-marked rows of exclusion_
variables. This makes manual filtering on these columns more difficult.mark_duration()
can't easily mark the difference b/w a too-short and a
too-long response, which might be a useful thing to know. mark_*()
and exclude_*()
use ellipses for
arguments to be passed to check_*()
, meaning you have to look up
check_*()
to get comprehensive help on how mark_*()
and exclude_*()
operate (this specific issue could probably be fixed with a few explicit
arguments in each function, even if everything else is left the same)mark_*()
functions, rather than check_*()
?
mark_*()
functions could cover all reporting. This could even enable some
time-saving advantages from summarize()
-like reporting, (e.g., could see at a
glance if all short-duration cases were those also out-of-US). exclusion_*
variables from mark
could use booleans, 1/0, (or even just ""),
for easier support of manual filtering if desiredcheck_*()
and exclude_*()
could be simplified:
check
and exclude
could just call mark
internally plus something
like dplyr::filter()
.check
and exclude
could even be reduced to just 2
general-purpose that keep/exclude based on any added exclusion_
variables Few small other things:
check_ip()
, exclude_ip()
, and mark_ip()
are backwards--each
retains/drops cases from outside the specified country, not from inside it as indicated in the documentation[object]_[verb]
structure,
while this uses the reverse. Personally, I think current naming is fine, though.@jmobrien thanks a lot for your review.
@JeffreyRStevens, you can now go ahead and address reviewer's comments. rOpenSci's guidelines seem to not limit how long you should take, but I think it's best to respond ASAP -- while the reviewers keep the context in their heads.
Please let me know if you have questions.
Yes, many thanks @juliasilge and @jmobrien for your helpful reviews! I've done a quick skim and will review in-depth very soon. I may have a few questions before I get too far into revisions.
@juliasilge, many thanks again for your helpful comments. Here are some responses/questions for your points:
exclude_duplicates()
: I think I see the issue here, and I believe that addressing some of @jmobrien's concerns about the functions will address this. I'll return to this comment after I address his concerns.collapse_exclusions()
: This was a great idea. I have now renamed this to unite_exclusions()
and marked collapse_exclusions()
as deprecated using the {lifecycle}
package. I have now push this change.exclude_*()
functions to filter_*()
: I really like this idea, too. But one question that I had is whether I should rather use filter_*()
to replace the check_*()
functions since the tidyverse filter()
function retains the specific rows rather than removes them. In this sense, it seems to be more equivalent to check_*()
rather than exclude_*()
, in that check_*()
retains the rows of data that are being checked. Or maybe I'm thinking of it backwards. What are your thoughts on this?{cli}
. I hadn't seen that before, and it looks like a very useful package for customizing messages. I'll definitely be figuring out ways to integrate that into the console messages.I'll have some questions for @jmobrien in the coming days.
I was really happy to see @jmobrien's thorough review and I agree that working through some of his suggestions will address the problems with the documentation and error messages.
I'm not sure what we really mean by a difference between "retain" vs. "throw away" (just reverse the condition, which in the case of, say, IP addresses in/out of the US doesn't seem like a super one-way thing) but I think I see your point. I guess my opinion is that creating three new verbs for operations that are arguably the same as things many people already use makes this toolkit less easy to use than if they were given names that allude to the ways in which they are the same as other operations. Another option may be to use mark_
+ filter_
+ exclude_
?
I also will say that I generally agreed with all that @juliasilge said in her review and thought it was on point for usefulness. Where I didn't echo her sentiments directly, it was mostly for brevity because I was the second to submit an already-long review. I especially agree that the console reporting is a big sell here.
Here, @JeffreyRStevens, I'm understanding you as saying exclude_duration([longer than 30 mins])
would remove all >30 minute cases, while filter([duration > 30 mins])
would keep those same cases and discard others. So, at least at default, the behavior of filter
is more conceptually similar to check
than exclude
. That right? If so, I see where you're coming from also.
From a more data-centric perspective on the potential userbase, I agree w/where Julia's coming from. The package does things very close to core tidyverse operations for data wrangling. There's something to be said for capitalizing on what those names invoke.
With my social scientist had on, though, I feel much more positively about the existing names. This whole package really just seems like a batch of convenience functions that abstract lower-level data wrangling operations into key tasks for ethically managing human-participants data. As tasks, I think the names make sense: manually "check" problems first, then "exclude" them prior to analysis.
I don't have a strong opinion about which way naming should go. Though FWIW, I suppose similar ideas are embedded in my speculation whether functionality should center on mark
. With that workflow, the check
and exclude
verbs could keep their core purpose as convenience functions called downstream of mark
(give me all the marked stuff for review vs. get rid of it). But if you find yourself needing something more complex, you can just do a standard call to filter
instead.
(Or, if you're feeling really ambitious, maybe you could make everyone happy by adding something like filter_exclusions()
that could emulate the typical filter syntax for more complex cases where the convenience functions fall short? So, e.g., filter_exclusions(!ip, duration)
could mean "give me all the cases that were NOT marked as out of the acceptable IP address space, but which were still too long", etc. Could maybe also work around whether unite_exclusions()
has been called yet or not.)
I don't know that I'd recommend adding another set of functions to maintain here. And for the record, my opinion on the naming is not super strong either; I'm raising this mostly because this is something I've run into before and in the past I have always regretted when I have not used what already exists in terms of vocabulary, norms, idioms, function names, etc. and then later realized I could have.
Thanks you two for your thoughts on this. Yes, my thought was that filter_
, mark_
, and exclude_
would more closely match the way filter()
works (that is, it retains rows satisfying the condition). But I also share @jmobrien's point that check_
might make more sense from a general user perspective even though filter_
makes sense from a tidyverse user/developer perspective. I'll grapple with this in the coming days.
I really like the idea of centering the functions around mark_
rather than check_
. @jmobrien, just to clarify, you were also suggesting removing all of the check_
and exclude_
functions and instead only have two functions: check
and exclude
, correct? So basically, you would always run the mark_
functions to find the exclusions, then pipe that to check
to view/store the exclusions or pipe to exclude
to remove the exclusions. Is this what you were suggesting? This sounds much easier to maintain and probably to use as well.
That was one option, yes. I guess in that simplifying approach check
and exclude
would need to recognize what parts that mark
functions had looked at previously. You could set that up a number of different ways.
You could also preserve the existing check
/exclude
function sets if you prefer a more standalone design where each verb can operate on fresh data directly. Even then, as outlined above I think there are several benefits to having check
and exclude
call mark
internally, rather than mark
and exclude
calling check
.
There's certainly some draw to option 1. But both options have unique advantages, and I don't want to dictate your design choices.
Option 1's a more major change, though, so if you're leaning that way I might get others' opinions first.
I'm close to finishing up revisions. I have a couple more changes to make and had two questions for @jmobrien.
mark_*()
functions?check_*()
and exclude+*()
functions, you suggested explicitly including some arguments. Can you point me to how to do this? My efforts to do this have not worked out.@JeffreyRStevens, responses below:
I think the extra joins in the current design might slow things down, but I don't know how much. And having just unified check
and exclude
functions would in theory be slightly more efficient than independent ones, because, e.g., the ip address checking code would only be called at the start when the dataset is marked. My guess is this isn't a big deal in most use cases, though, and probably even less than I was thinking it might be even with big datasets. So, it's a point of optimization, but not a big priority.
Can you clarify what's giving you trouble? In many cases it can be handled straightforwardly. Here's an example from qualtRics
, where fetch_survey()
has arguments that get passed directly to read_survey()
, using the same argument names in both functions: https://github.com/ropensci/qualtRics/blob/master/R/fetch_survey.R
Thanks, @jmobrien. I figured out (2) with your examples. I expect to have everything wrapped up in the next week.
OK, I think that I have addressed all of the concerns. Please let me know if I missed something or you have additional questions. Note that I have increased the version to 0.3.0 since I have made major changes to the code and deprecated a function.
exclude_duplicates()
: This has been addressed by explicitly including arguments from mark_*()
. Commit 1b74c4cecollapse_exclusions()
: This was a great idea. I have now renamed this to unite_exclusions()
and marked collapse_exclusions()
as deprecated using the {lifecycle}
package. Commit 68aae92b.exclude_*()
functions to filter_*()
: I'm still torn on this, but I think that I will keep the names as originally named. In part, this is due to the confusion that we've had about whether check
or exclude
should be changed to filter
. I now use filter for both types of actions. Maybe a compromise is to use the word 'filter' more in the documentation?{cli}
. I hadn't seen that before, and it looks like a very useful package for customizing messages. I'll definitely be figuring out ways to integrate that into the console messages.dplyr::bind_cols()
and rbind()
: I have switched all rbind()
uses to dplyr::bind_cols
. Commit 89294932dplyr::pull()
instead of [[]]
: I have switched all dplyr::pull()
uses to [[]]
. Commit ec655f9bdplyr::all_of()
and dplyr::any_of()
to tidyselect::all_of()
and tidyselect::any_of()
. Are there ways to easily find the original package for all of your functions in a script? Commit face90dcheck_*()
to mark_*()
functions: I have now done this. The check_*()
and exclude_*()
functions now just run the mark_*()
functions and filter exclusions in or out. Commit 877e3b9mark_duration()
does not differentiate between too long and too short: mark_duration()
now marks quick and slow durations separately. Commit d45e44a6mark_*()
and exclude_*()
use ellipses for arguments to be passed to check_*()
: This has been addressed by explicitly including arguments from mark_*()
. Commit 1b74c4cecheck
and exclude
: I've decided to hold off on this for the moment. Instead, I have offered an example in the vignette on piping multiple mark_*()
functions and then check with a filter. Commit 12b4ad2b&&
instead of &
: I have now replaced &
with &&
. Commit b9aa1b04[object]_[verb]
: While I saw that this was a preference for rOpenSci packages, it does not seem natural for this set of functions. [verb]_[object]
both categorizes the verbs better and matches {tidyverse} syntax (e.g., bind_rows()
/bind_cols()
) as well as the {qualtRics} syntax (e.g., fetch_survey()
).duplicates.R
). Commit 3c88d484Thanks everyone! It's been great to watch this review.
Dear reviewers @juliasilge and @jmobrien,
At this stage, we ask that you respond as to whether the changes sufficiently address any issues raised in your review. We encourage ongoing discussion between package authors and reviewers, and you may ask editors to clarify issues in the review thread as well. -- https://devguide.ropensci.org/reviewerguide.html#followupreviewer
Generally, I think this looks quite good. I just want to do a bit more due diligence on the recentering to mark
before giving final approval, since that was the most major suggestion from me.
I did happen to look at the *_ip()
family of functions about a week ago (following the reorganization, just before @JeffreyRStevens 's most recent post). One thing I noticed was that the transition towards mark
didn't end up streamlining the internals of data processing in quite the way I expected.
BUT--I then fiddled with it briefly myself, and started to think my expectations were mis-calibrated. In the *_ip()
functions, for instance, the iptools
functions being used have some unforgiving aspects that needed designing around. I didn't previously appreciate that issue, and it's changed my perspective quite a bit.
So, I still want to complete a full look-though, but I'm feeling pretty optimistic that everything should be fine.
I now have a survey that has some IP addresses and the iptools integration looks good. 👍
Using mark_*()
as a function to build off of looks like it worked really well but now there is a fair amount of repeated code that you might want to abstract out into functions. Like for exclude_*()
, wrap something like this in if (identical(silent, FALSE))
:
print_exclusion <- function(remaining_data, x, msg) {
n_remaining <- nrow(remaining_data)
n_exclusions <- nrow(x) - n_remaining
message(
n_exclusions, " out of ", nrow(x),
msg, n_remaining, " rows."
)
}
You might want to make something like that for all the invisible returning too.
exclude_duration(survey_raw, min_duration = 100)
and thinking "where is my data?"). This was confusing the first time I reviewed too so I probably should have mentioned it then 🙈 but I wonder if this should be reevaluated. A switch in default to print = TRUE
?@juliasilge, thank you for the follow up. That is a great idea to start building more utility functions to reduce repeated code. I'm now working on both the print_exclusion()
function and the invisible returning (and will look for other opportunities to replace repeated code with functions). As a side note, I started using the {cli}
package to retool the print exclusion messages, and I really like it. So I'm going to go ahead and work at revamping all of the messages.
Yes, I switched the default for the exclude_*()
functions to be print = FALSE
in a recent update. My thinking was that you probably do want to actually see the 'problematic' rows when using the check_*()
functions because you are checking. You don't need to see the rows when using the mark_*()
functions because it returns the whole data set with columns added to the end (which would be difficult to see when printed to the console). When excluding, you likely don't need to see the remaining rows because (1) they are not problematic and (2) likely the vast majority of the original data will not be excluded, so there will be a lot of rows that would be printed to the console. But if you think that it makes more sense to return to printing the output of the exclude_*()
functions, I can do that---especially if you were expecting to see the output.
@JeffreyRStevens, have you seen this?:
Ellipses:
When to print the return value or return the first argument invisibly:
@juliasilge
I have now created four new utility functions (in utils.R
) that clean up code in the verb functions.
validate_columns()
is used in the mark_*()
functions to check the number, names, and types of input columns before data processing.mark_rows()
is used the mark_*()
functions to create the new column that marks rows that meet exclusion criteria.print_exclusions()
is used in the exclude_*()
functions to print a message to the console with the number of excluded rows.print_data()
is used in the check_*()
and exclude_*()
functions to print the data to the console.I have also switched all messages to the cli::cli_alert_info()
syntax (though error messages continue to use stop()
).
I have made print = TRUE
the default for all exclude_*()
functions.
These new utility functions look great! 🙌 There is a lot less repeated code and I think this will make the package much more maintainable and easy to contribute to.
The new cli use turned out slick; I have had such good experience using that package and I hope that it works out as well for you.
I still am not sure about how the functions return/not return invisibly. It is confusing and unexpected to me that they have different behaviors by default. Those links that Mauro shared are super good and important here; these are not functions that are called largely for their side effects (like writing a file to disk) but rather we directly want that function output (the data) to use in a pipeline, to summarize, just generally to compute on. I was, as I tried to use all three function types, automatically looking for: how many rows there were, how many columns there were, what the new columns created by mark_*()
were called, etc. I do think all the functions should have the same default, and my opinion is that the right call is not to return invisibly.
@juliasilge
mark_*()
functions return not invisibly. So all three verbs should return explicitly by default.@jmobrien, will you be able to wrap up your review soon?
@maurolepore, what are the next steps now? FYI, I'm working on a paper for JOSS. I'm not sure if that should be part of the review here or just over at JOSS.
@JeffreyRStevens,
I'm watching your discussion with the reviewers. Do you feel the ball is no longer on your court? If so I would ask the reviewers once again whether your changes sufficiently address any issues they raised in the review. Please confirm and I'll write a new comment with a @mention
to them.
We no longer review submissions to JOSS, but do mention the review -- I think you might get fast-tracking.
@maurolepore, yes, I think that I've addressed the reviewers concerns, and I'm waiting for their final approval. Thanks.
Dear reviewers,
@jmobrien, do you have an estimate about when you might be able to confirm whether the changes made are sufficient to approve this package?
@juliasilge, it seems your last concern (return visibly) has been addressed. Is there anything else you'd like to request or do you confirm the changes made are sufficient to approve this package?
Mauro, and Jeffrey, apologies--I've had a remarkable variety of unexpected home and family issues come up over just this past week. I'm about to go handle what I hope is the last of them, but I'll make it a priority to finish everything tomorrow. Thanks for your patience.
On Sat, Oct 30, 2021 at 8:37 PM Mauro Lepore @.***> wrote:
Dear reviewers,
@jmobrien https://github.com/jmobrien, do you have an estimate about when you might be able to confirm whether the changes made are sufficient to approve this package?
@juliasilge https://github.com/juliasilge, it seems your last concern (return visibly) has been addressed. Is there anything else you'd like to request or do you confirm the changes made are sufficient to approve this package?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/software-review/issues/455#issuecomment-955620821, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWEMISKFWMP2CRZQHAGBG3UJSMWRANCNFSM5BA7NRLQ .
For me, the changes have addressed the issues I was interested in, and I approve it moving forward. ✅
Date accepted: 2021-11-04 Submitting Author Name: Jeffrey Stevens Submitting Author Github Handle: !--author1-->@JeffreyRStevens<!--end-author1-- Repository: https://github.com/JeffreyRStevens/excluder Version submitted: 0.2.2 Submission type: Standard Editor: @maurolepore Reviewers: @juliasilge, @jmobrien
Due date for @juliasilge: 2021-09-20 Due date for @jmobrien: 2021-09-20Archive: TBD Version accepted: TBD
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
The package falls under data munging because it processes data from Qualtrics or other online sources by checking for, marking, and excluding rows of data frames for common exclusion criteria (e.g., IP addresses outside of the United States or duplicate entries from the same location/IP address).
The target audience is data scientists using Qualtrics or other online systems to collect data from participants (e.g., Mechanical Turk workers). Ensuring good data quality from these participants can be tricky. For instance, while Mechanical Turk in theory screens workers based on location (e.g., if you want to restrict your participant pool to workers in the United States), this is not necessarily represented in the data. Finding the tools to screen for IP address location can be tricky, and this package simplifies checking for and excluding participants based on common data that Qualtrics reports such as geolocation, IP address, duplicate records from the same location, participant screen resolution, participant progress through the survey, and survey completion duration.
There are no similar packages to my knowledge. The {qualtRics} package at rOpenSci focuses on importing data from Qualtrics. The {MTurkR} package directly interfaces with the MTurk Requestor API, but the APIs have been deprecated and the package has been removed from CRAN.
Yes, it seems to comply with this guidance. Depending on the data that the user collects, there could be personally identifiable information accessed by this package. In particular, IP addresses that are recorded by Qualtrics can be processed with this package. Note that the package only works with personally identifiable information from data sets that already exist on the users' local file system, and the package does not collect or transmit data in any way. The package also includes a function
deidentify()
that the user can use to strip location, IP address, language and even participant computer information (e.g., operating system, web browser, screen resolution) from the data frames to deidentify them.https://github.com/ropensci/software-review/issues/454
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
[X] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct