tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 418 forks source link

Column-wise `replace_na` #359

Closed l-d-s closed 6 years ago

l-d-s commented 7 years ago

Suggestion: replace_na should be column-wise, have a column-wise mode, or exist in a column-wise version somewhere in the tidyverse.

cderv commented 7 years ago

Can you provide an example of what you expect ? I am curious and I don't see it straightforward. Thank you.

l-d-s commented 7 years ago

Sometimes one wants to replace NAs in a single variable.

df %>%
    mutate(y = replace_na(x, 0))

Maybe such a command would properly belong in dplyr.

Alternatively, using NSE one might have

df %>%
    replace_na(x = 0, y = "unknown")

or similar instead of

df %>%
    replace_na(list(x = 0, y = "unknown"))

though maybe there are tradeoffs here I'm not aware of.

yutannihilation commented 7 years ago

For a single variable, maybe dplyr::coalesce()?

dplyr::coalesce(c(1, 2, NA), 0)
#> [1] 1 2 0
cderv commented 7 years ago

We could come up with a new replace function that understant non-list argument using tidyeval

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(A = c(1L, NA_integer_, 3L, NA_integer_, 5L), 
                 b = c("a", "b", NA_character_, "c", NA_character_),
                 c = c(NA, 3, 3.5, NA, 87)
)

tidyr::replace_na as a ... arg already available. it could be use for your use case.

replace_na_new <- function(data, replace = list(), ...){
  stopifnot(rlang::is_list(replace))
  replace <- rlang::modify(replace, ...)
  for (var in names(replace)) {
    data[[var]][rlang::are_na(data[[var]])] <- replace[[var]]
  }
  data
}

I use rlang::modify for testing but I think it is not a good choice.

Here are some example then:

# just replace for one column
df %>%
  replace_na_new(A = 99999L)
#> # A tibble: 5 x 3
#>       A     b     c
#>   <int> <chr> <dbl>
#> 1     1     a    NA
#> 2 99999     b   3.0
#> 3     3  <NA>   3.5
#> 4 99999     c    NA
#> 5     5  <NA>  87.0

# replace two columns
df %>%
  replace_na_new(A = 99999L, b = "unknown")
#> # A tibble: 5 x 3
#>       A       b     c
#>   <int>   <chr> <dbl>
#> 1     1       a    NA
#> 2 99999       b   3.0
#> 3     3 unknown   3.5
#> 4 99999       c    NA
#> 5     5 unknown  87.0

# replace three columns
df %>%
  replace_na_new(replace = list(A = 99999L), b = "unknown", c = 0)
#> # A tibble: 5 x 3
#>       A       b     c
#>   <int>   <chr> <dbl>
#> 1     1       a   0.0
#> 2 99999       b   3.0
#> 3     3 unknown   3.5
#> 4 99999       c   0.0
#> 5     5 unknown  87.0

Here the principle if args is given more than once, second replace the first.

df %>%
  replace_na_new(replace = list(A = 99999L), b = "unknown", A = 0L)
#> # A tibble: 5 x 3
#>       A       b     c
#>   <int>   <chr> <dbl>
#> 1     1       a    NA
#> 2     0       b   3.0
#> 3     3 unknown   3.5
#> 4     0       c    NA
#> 5     5 unknown  87.0

Probably we should choose another way with more checking. I use rlang::modify by simplicity for the example. Moreover it is no more working with df %>% replace_na(df): so not the definitive solution for sure.

You also could include this type of function in your script or package using quosure on top of tidyr::replace_na. For example, using only non-list arg

replace_na_new <- function(data, ...){
  replace_dots <- rlang::dots_list(...)
  tidyr::replace_na(data, replace = replace_dots)
  }
df %>%
  replace_na_new(A = 99999L, b = "unknown")
#> # A tibble: 5 x 3
#>       A       b     c
#>   <int>   <chr> <dbl>
#> 1     1       a    NA
#> 2 99999       b   3.0
#> 3     3 unknown   3.5
#> 4 99999       c    NA
#> 5     5 unknown  87.0

I could try working on a PR for a more durable solution and based on this idea but I don't know the plan from the team on this one. And rlang is not easy to dig into easily. will see what the tidyverse team have to say about this idea.

jennybc commented 7 years ago

I'm not sure what will become of the vctrs package/repo, but its issues are currently a parking place for many to do's, one of which is basically this issue:

https://github.com/hadley/vctrs/issues/19

l-d-s commented 7 years ago

@jennybc Great. Makes sense that it be dealt with in vctrs then. Thanks!