Stricter {dplyr} - Githubissues

coolbutuseless commented 4 years ago

While {dplyr} is amazingly useful, it also has some behaviour that I need to remember to workaround or prepare for. E.g.

Remembering to ungroup() after doing a group_by()
case_when() allows things like:
- result is NA if no rule matches
- user can specify 2 rules which match the same input value. Only the first is used.
filter() being a bit 'loose' on what it does if the filter statement results in an NA.

Idea: create a package which contains versions of these {dplyr} functions but it more strict in what it will accept.

I'll flesh this out with some more concrete examples in the next week.

This isn't useful to everyone, so I'm not advocating changes to {dplyr} itself.

Other packages which aim to replace/re-implement parts of {dplyr}

@MilesMcBain and a solution to having to remember to 'ungroup()' - https://github.com/MilesMcBain/wisegroup
Faster alternatives to dplyr functions written with a {data.table} backend: https://github.com/TysonStanley/tidyfast
{bplyr} (formerly {noplyr}) which is a base R implementation of {dplyr} - https://github.com/yonicd/bplyr by @yonicd
Another base R implementation of {tidyverse} https://github.com/hrbrmstr/freebase. by @hrbrmstr

coolbutuseless commented 4 years ago

Like @milesmcbain 's {wisegroup}, this hypothetical {strict_dplyr} would probably make use of {conflicted} to control which versions you wanted to use.

MilesMcBain commented 4 years ago

Tag me twice and I'm in.

Another take on this I've been mulling over is: Would it be possible to detect some dangerous usages of dplyr via static code analysis?

For example, all these are dangerous because they save an object that contains groups:

library(dplyr)

star_var <-
  starwars %>%
  group_by(homeworld, species, gender) %>%
  summarise(n_chars = n())

star_var2 <-
  starwars %>%
  group_by(homeworld, species, gender) %>%
  count()

star_var3 <-
  star_wars %>%
  group_by(homeworld, species) %>%
  mutate(max_mass = max(mass))

coolbutuseless commented 4 years ago

Maybe there could be a pipe that detects if it contains a group_by that is unmatched by an ungroup?

yonicd commented 4 years ago

I think having a dplyr dev that is open to discussing this topic on this thread would make it more inclusive and useful. I also remember that @krlmlr is interested in this type of topic.

coolbutuseless commented 4 years ago

An example of how case_when() might bite you - the following code doesn't do what you hope it might i.e. the use of a TRUE fall through in case_when can be painfull!

animal <- c('cat', 'dog', 'eagle')

case_when(
  animal == 'catt' ~ 'mammal',
  animal == 'dog'  ~ 'mammal',
  TRUE             ~ 'bird'
)

I wrote more about case_when here: https://coolbutuseless.github.io/2018/09/06/strict-case_when/

It'd be interesting to use modern rlang and tidyverse idioms to re-write that strict_case_when.

There are many other cases where dplyr has behaved in ways I did not want because of typos and NA:

filter where the conditional evaluates to NA where I didn't expect any
mutate statement that uses a global environment variable because I misspelled a column in the data.frame
a left_join where I had assumed perfect one-to-one matching between the rows of the datasets, but one of the data.frames had multiple rows for a particular value.

I'd be interested to hear where other users might see pain-points with dplyr!

yonicd commented 4 years ago

Maybe there could be a pipe that detects if it contains a group_by that is unmatched by an ungroup?

Getting people to use a different pipe operator isn’t realistic imo.

coolbutuseless commented 4 years ago

@yonicd I agree that getting people to use a totally new pipe is not realistic in general.

However, this hypothetical {strict_dplyr} would include a drop-in replacement for the pipe. If you loaded this package, then it would override the dplyr/magrittr one and behave exactly the same, except that it would be more pedantic about ungroup.

If you didn't load {strict_dplyr} then your code would behave "normally".

That is, {strict_dplyr} would be a subset of {dplyr} behaviour. In your script, you could write dplyr code as normal, then afterwards only have to add library(strict_dplyr) and the same could would run as long as some stricter assumptions are enforced.

yonicd commented 4 years ago

wouldn't that cause a conflicting NS with scrict_dplyr and another NS that uses %>%? If not then I am for such a solution. From my experience with bplyr, I would be surprised if this isn't the case though.

coolbutuseless commented 4 years ago

I am totally unsure of how namespace conflicts might work here. I'm hoping that {conflicted} offers something to help - I need to investigate!

coolbutuseless commented 4 years ago

I've sketched out a few ideas/thoughts in a package shell here: https://github.com/coolbutuseless/strictlyr

coolbutuseless commented 4 years ago

The consensus was that this seemed too much like "work" for an unconf, and nothing was done on this.

Development will continue over at: https://github.com/coolbutuseless/strictlyr

ropensci / ozunconf19

Stricter {dplyr} #17