Closed rgknight closed 7 years ago
An alternative would be to keep the get_
syntax and develop families of check_
, count_
and get_
functions.
For example, you could have:
check_dupes
as abovecheck_complete
that alerts if there is missing data in the specified vars (NA
, NULL
or ""
)check_bindable
that checks whether two dfs have the same vars with the same data types (#50) And so on. This may be a better fit for the package given the clean_names
syntax.
In any case I think it would be worth making a stand one way or another on whether it's verb_object or object_verb, then being very rigorous about consistent naming, e.g., if it's verb_object should ns_to_percents
be convert_n_percent
?
Let me know what you think @sfirke and I can submit a PR for either check_dupes
or dups_check
(if you think it would be helpful -- I'm finding it very useful in my work).
Thanks for thinking through this in such detail. It'll take me a little while to fully respond, as I'm in a crunch time right now, but I wanted to acknowledge that I saw and appreciate this. On the check_dupes
: function: have you seen the assertr package? It does in-line data quality checks for productionized analysis.
I understand, also in a crunch time here.
Re: naming conventions
Having played with this a bit, it now seems to me like [verb][object] is the more natural syntax, since it's a six to one, half a dozen to the other thing, and [verb][object] mostly matches the existing janitor conventions and also matches the conventions of packages like devtools (e.g., install_[location]).
It may still be worth thinking about what verbs you would want janitor to have and which of the proposed functions you would want to include.
Re: checking functions
I have taken a look at assertr
. It looks helpful. I have started using loggr in my work, which I like because you specify the level of the alerts and it works with errors generated by any R function.
In any case, per the mission of janitor, the goal of any check functions would not be to create new features unavailable elsewhere, but rather to wrap the more general purpose tools to make a few of the most common checking predicates highly accessible.
I agree that verb_object
is the way to go. I'm open to check_dupes()
, per your argument that janitor makes other package functionality highly accessible to beginners. Perhaps we even note in the documentation where they can find more check_
-type functionality in other packages. It still feels a little bit like mission creep to open the can of worms of the assert-type checks, but let's take a stab at it. I know I'd use it in production immediately, which is a good test (and you note you've also found it useful).
I'm making more notes over in the PR.
I see you've broken the dupe_count
column from get_dupes()
into the dups_count
function and then changed get_dupes()
into dups_get
. I think it's simpler with one function, rather than toggling between them (when usually I want how many dupes and the offending rows to examine). But your implementation above of dups_get
is much cleaner and likely faster than get_dupes
as long as we can add back in the dupe_count
column. Which we could rename as long as we have so many other changes... maybe "n_dups" or "count_of_dups"?
I will break the PR up into bit sized pieces -- that definitely makes more sense.
I hear you on the mission creep of the style of checking I proposed. I think that we should scale it back so that janitor includes only some standard predicates used in data cleaning / checking, not the assertion handling. So check_dupes
would become no_dupes
or is_id
(similar to the Stata is.id, which I used all the time back in the day) and would return only true/false, instead of all the level nonsense I proposed. This would let you pass it into whichever checking paradigm you want (stopifnot
or assertr
or assertthat
or testthat
or etc).
Sorry for the confusion on the get_dupes
name -- I thought that I had rolled back the renaming in a later commit. Would you be OK with adding the count as an option to get_dupes
? Or break get_dupes
into two functions, one that returns just the duplicates and one that returns duplicates adorned with the count? The simple version of get_dupes
is 30x faster (#67), so I think it's worth making it accessible both with and without the count.
On the naming topic, I think you have something great in tabyl
and crosstab
. It would be helpful to think more about whether you want to treat these like ggplot
that you can modify with geom_
or theme
elements, or whether you want to handle modifications through options to tabyl
and crosstab
. If you want to modify them using separate functions, those functions should all start with the same prefix, so that new users know that this is something that primarily modifies a tabyl / crosstab (even if it could be used on any df). The prefix could be add_
or it could be adorn_
or something else entirely, but it will be more accessible to new users if all things that modify a tabyl / crosstab have the same prefix.
Closing this based on our new approach to planning, which supplants the conversation here.
It appears that janitor is evolving to have a family of duplicate-handling functions (#72 , #18).
I would like to propose that we adopt a common naming convention of
dup_
for this family of functions.The core functions that we could identify right now would be:
dups_check
Check whether there are duplicates and handle. Should be pipe-able.dups_count
Return a count of duplicatesdups_get
A simpler and faster (#67) version of the currentget_dupes
that eliminates the counting, leaving it up to the user whether she wants to merge in dups_count or runcount()
dups_fuzzy
ordups_possible
Implementation of #72This does read a little more awkward to me than the reverse (
check_dups
,count_dups
, etc), but it would allow for users to see all available duplicate handling functions by typing "dup_".The workflow would be:
dups_check
assertions to make sure that no duplicates are being introduced.dups_check
fails usedups_count
and/ordups_get
interactively to problem solveA basic implementation could be:
Example workflow
Thoughts?