tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.76k stars 2.12k forks source link

`filter(.missing = )` option to optionally retain missing values #6560

Open DavisVaughan opened 1 year ago

DavisVaughan commented 1 year ago

Currently, filter():

A number of requests have come up in the past desiring:

Here are a few:

This is most apparently annoying when you have multiple columns to filter by

library(dplyr)

df <- tibble(
  x = c(TRUE, FALSE, NA, NA, NA),
  y = c(NA, TRUE, NA, NA, NA),
  z = c(TRUE, TRUE, TRUE, FALSE, NA)
)
df
#> # A tibble: 5 × 3
#>   x     y     z    
#>   <lgl> <lgl> <lgl>
#> 1 TRUE  NA    TRUE 
#> 2 FALSE TRUE  TRUE 
#> 3 NA    NA    TRUE 
#> 4 NA    NA    FALSE
#> 5 NA    NA    NA

filter(df, x, y, z)
#> # A tibble: 0 × 3
#> # … with 3 variables: x <lgl>, y <lgl>, z <lgl>

filter(df, x | is.na(x), y | is.na(y), z | is.na(z))
#> # A tibble: 3 × 3
#>   x     y     z    
#>   <lgl> <lgl> <lgl>
#> 1 TRUE  NA    TRUE 
#> 2 NA    NA    TRUE 
#> 3 NA    NA    NA

I propose a .missing = c("drop", "keep", "error") argument to filter() that would allow you to optionally keep rows with NA.

We'd have to carefully analyze the boolean algebra here to make sure we are being consistent. In particular I think we want to make sure these are the same if we do this, but I think they are:

# these should be the same
filter(df, x, y, .missing = "drop")
filter(df, x & y, .missing = "drop")

# these should be the same
filter(df, x, y, .missing = "keep")
filter(df, x & y, .missing = "keep")

The "drop" case is probably already consistent because that is what we do today, and the "keep" case is probably like this, which seems consistent

na_to_true <- function(x) {
  x[is.na(x)] <- TRUE
  x
}

na_to_true(TRUE & NA)
#> [1] TRUE
na_to_true(TRUE) & na_to_true(NA)
#> [1] TRUE

When we do this, we should also think about whether vec_pall() or vec_pany() could be used in filter() in any way, since they are heavily optimized for performance.

shannonpileggi commented 1 year ago

Just chiming in here to say:

  1. I love the .missing argument!
  2. I love the last request listed 😂
  3. One more use case / request + discussion from twitter 😊
williamlai2 commented 1 year ago

Perhaps also a warning when there are NA rows that are to be dropped when .missing is not supplied.

DavisVaughan commented 1 year ago

@williamlai2 I think the current behavior is right for most people, so I doubt we'd want to add a warning by default for something that people typically expect to happen

williamlai2 commented 1 year ago

Is it expected though? Judging by the links in your first post I don't think it is. I'm a long time user and had this issue yesterday. I didn't realise it was happening until I noticed a lot fewer rows than expected and had to look up the solution.

hadley commented 1 year ago

@williamlai2 we’re not reconsidering the default behaviour at this time.

moodymudskipper commented 1 year ago

I had a similar issue in {powerjoin} and addressed it by defining new operators for extended equality and rowwise %in%, that makes it a bit more flexible since we don't have to commit to the behavior for all arguments. I do see the value of the .missing arg though.

library(dplyr, w = F)
df <- tibble(
  x = c(TRUE, FALSE, NA, NA, NA),
  y = c(NA, TRUE, NA, NA, NA),
  z = c(TRUE, TRUE, TRUE, FALSE, NA)
)

# extended equality "bone operator"
`%==%` <- function(x, y) {
  is.na(x) & is.na(y) | !is.na(x) & !is.na(y) & x == y
}

filter(df, ! x %==% FALSE, ! x %==% FALSE, ! z %==% FALSE)
#> # A tibble: 3 × 3
#>   x     y     z    
#>   <lgl> <lgl> <lgl>
#> 1 TRUE  NA    TRUE 
#> 2 NA    NA    TRUE 
#> 3 NA    NA    NA

# row-wise `%in%`
`%in.%` <- function(x, y) {
  conds <- lapply(y, function(yi) x %==% yi)
  Reduce(`|`, conds)
}

filter(df, ! FALSE %in.% list(x, y, z))
#> # A tibble: 3 × 3
#>   x     y     z    
#>   <lgl> <lgl> <lgl>
#> 1 TRUE  NA    TRUE 
#> 2 NA    NA    TRUE 
#> 3 NA    NA    NA

Created on 2023-04-29 with reprex v2.0.2