Open coolbutuseless opened 4 years ago
Like @milesmcbain 's {wisegroup}, this hypothetical {strict_dplyr} would probably make use of {conflicted} to control which versions you wanted to use.
Tag me twice and I'm in.
Another take on this I've been mulling over is: Would it be possible to detect some dangerous usages of dplyr
via static code analysis?
For example, all these are dangerous because they save an object that contains groups:
library(dplyr)
star_var <-
starwars %>%
group_by(homeworld, species, gender) %>%
summarise(n_chars = n())
star_var2 <-
starwars %>%
group_by(homeworld, species, gender) %>%
count()
star_var3 <-
star_wars %>%
group_by(homeworld, species) %>%
mutate(max_mass = max(mass))
Maybe there could be a pipe that detects if it contains a group_by
that is unmatched by an ungroup
?
I think having a dplyr dev that is open to discussing this topic on this thread would make it more inclusive and useful. I also remember that @krlmlr is interested in this type of topic.
An example of how case_when()
might bite you - the following code doesn't do what you hope it might i.e. the use of a TRUE
fall through in case_when
can be painfull!
animal <- c('cat', 'dog', 'eagle')
case_when(
animal == 'catt' ~ 'mammal',
animal == 'dog' ~ 'mammal',
TRUE ~ 'bird'
)
I wrote more about case_when
here: https://coolbutuseless.github.io/2018/09/06/strict-case_when/
It'd be interesting to use modern rlang
and tidyverse
idioms to re-write that strict_case_when
.
There are many other cases where dplyr has behaved in ways I did not want because of typos and NA:
filter
where the conditional evaluates to NA
where I didn't expect anymutate
statement that uses a global environment variable because I misspelled a column in the data.frameleft_join
where I had assumed perfect one-to-one matching between the rows of the datasets, but one of the data.frames had multiple rows for a particular value.I'd be interested to hear where other users might see pain-points with dplyr!
Maybe there could be a pipe that detects if it contains a
group_by
that is unmatched by anungroup
?
Getting people to use a different pipe operator isn’t realistic imo.
@yonicd I agree that getting people to use a totally new pipe is not realistic in general.
However, this hypothetical {strict_dplyr} would include a drop-in replacement for the pipe. If you loaded this package, then it would override the dplyr/magrittr one and behave exactly the same, except that it would be more pedantic about ungroup
.
If you didn't load {strict_dplyr} then your code would behave "normally".
That is, {strict_dplyr} would be a subset of {dplyr} behaviour. In your script, you could write dplyr code as normal, then afterwards only have to add library(strict_dplyr)
and the same could would run as long as some stricter assumptions are enforced.
wouldn't that cause a conflicting NS with scrict_dplyr and another NS that uses %>%
? If not then I am for such a solution. From my experience with bplyr
, I would be surprised if this isn't the case though.
I am totally unsure of how namespace conflicts might work here. I'm hoping that {conflicted} offers something to help - I need to investigate!
I've sketched out a few ideas/thoughts in a package shell here: https://github.com/coolbutuseless/strictlyr
The consensus was that this seemed too much like "work" for an unconf, and nothing was done on this.
Development will continue over at: https://github.com/coolbutuseless/strictlyr
While {dplyr} is amazingly useful, it also has some behaviour that I need to remember to workaround or prepare for. E.g.
ungroup()
after doing agroup_by()
case_when()
allows things like:filter()
being a bit 'loose' on what it does if the filter statement results in an NA.Idea: create a package which contains versions of these {dplyr} functions but it more strict in what it will accept.
I'll flesh this out with some more concrete examples in the next week.
This isn't useful to everyone, so I'm not advocating changes to {dplyr} itself.
Other packages which aim to replace/re-implement parts of {dplyr}