tidyverse / funs

Collection of low-level functions for working with vctrs
Other
34 stars 7 forks source link

Revisit `dplyr::coalesce` with `across` #54

Open tmastny opened 4 years ago

tmastny commented 4 years ago

With dplyr 1.0.0 introducing c_across and across I was wondering if it was possible to revisit tidyverse/dplyr#3548, by allowing dplyr::coalesce to work more naturally with the new across or c_across functions.

After reading the row-wise article, I expected dplyr::coalesce to work like rowSums since it naturally works across rows, or at worst it would work like rowwise => sum.

However, coalesce doesn't seem to work with the across family at all, as you can see in the code below.

Would it be possible to make coalesce compatible with the new across workflow?

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:5, 
  w = c(10, NA, NA, NA, 14), 
  x = c(NA, 21, 22, 23, NA), 
  y = c(NA, NA, 32, 33, NA), 
  z = c(NA, NA, NA, 43, 44)
)

## Does coalesce work like rowSums, because
## they both naturally work across rows?
df %>%
  mutate(a = rowSums(across(-id), na.rm = TRUE))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    54
#> 4     4    NA    23    33    43    99
#> 5     5    14    NA    NA    44    58

# No: coalesce doesn't work like rowSums
df %>%
  mutate(a = coalesce(across(-id)))
#> # A tibble: 5 x 6
#>      id     w     x     y     z   a$w    $x    $y    $z
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10    NA    NA    NA
#> 2     2    NA    21    NA    NA    NA    21    NA    NA
#> 3     3    NA    22    32    NA    NA    22    32    NA
#> 4     4    NA    23    33    43    NA    23    33    43
#> 5     5    14    NA    NA    44    14    NA    NA    44

## Maybe it works like sum, since coalesce's argument is `...`
df %>%
  rowwise() %>%
  mutate(a = sum(c_across(-id), na.rm = TRUE))
#> # A tibble: 5 x 6
#> # Rowwise: 
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    54
#> 4     4    NA    23    33    43    99
#> 5     5    14    NA    NA    44    58

# No: coalesce doesn't work with rowwise
df %>%
  rowwise() %>%
  mutate(a = coalesce(c_across(-id)))
#> Error: `mutate()` argument `a` must be recyclable.
#> ℹ `a` is `coalesce(c_across(-id))`.
#> ℹ The error occured in row 1.
#> x `a` can't be recycled to size 1.
#> ℹ `a` must be size 1, not 4.
#> ℹ Did you mean: `a = list(coalesce(c_across(-id)))` ?

## coalesce works if you write out each by hand,
## but that goes against the spirit of the new `across` family
df %>%
  mutate(a = coalesce(w, x, y, z))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

# there is a work around suggested in tidyverse/dplyr#3548, but it's not very user friendly
# and requires a different package
library(tidyselect)
df %>%
  mutate(a = coalesce(!!!syms(vars_select(names(.), -id))))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

Created on 2020-04-14 by the reprex package (v0.3.0)

hadley commented 4 years ago

This should work, but I can't immediately understand why it doesn't:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:5, 
  w = c(10, NA, NA, NA, 14), 
  x = c(NA, 21, 22, 23, NA), 
  y = c(NA, NA, 32, 33, NA), 
  z = c(NA, NA, NA, 43, 44)
)

df %>%
  mutate(a = coalesce(!!!across(-id)))
#> Error in .subset2(chunks, self$get_current_group()): attempt to select less than one element in integerOneIndex

Created on 2020-04-14 by the reprex package (v0.3.0)

romainfrancois commented 4 years ago

splicing happens "too early", but this works:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:5, 
  w = c(10, NA, NA, NA, 14), 
  x = c(NA, 21, 22, 23, NA), 
  y = c(NA, NA, 32, 33, NA), 
  z = c(NA, NA, NA, 43, 44)
)

coacross <- function(...) {
  coalesce(!!!across(...))
}

df %>%
  mutate(a = coacross(-id))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

Created on 2020-04-15 by the reprex package (v0.3.0)

romainfrancois commented 3 years ago

Feature request: coalesce working backwards, i.e. returning the last non-missing column: coalesce() returns the first non-missing passed column/vector value. However, there are use-cases where the opposite would be helpful, i.e. returning the last non-missing value from several columns/vectors.

eutwt commented 3 years ago

In case anyone comes across this issue after googling, another workaround is to use do.call(coalesce, across(-id)), which is a little less typing than coalesce(!!!syms(vars_select(names(.), -id)))) and no extra package.

If you want to do it in reverse you could just rev the input to coalesce, although that's probably inefficient.

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:5,
  w = c(10, NA, NA, NA, 14),
  x = c(NA, 21, 22, 23, NA),
  y = c(NA, NA, 32, 33, NA),
  z = c(NA, NA, NA, 43, 44)
)

df %>%
  mutate(a = do.call(coalesce, across(-id)))
#> # A tibble: 5 × 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

df %>%
  mutate(a = do.call(coalesce, rev(across(-id))))
#> # A tibble: 5 × 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    32
#> 4     4    NA    23    33    43    43
#> 5     5    14    NA    NA    44    44

Created on 2021-08-04 by the reprex package (v2.0.0)

ericemc3 commented 2 years ago

What about:

df <- tibble(
  id = 1:5, 
  w = c(10, NA, NA, NA, 14), 
  x = c(NA, 21, 22, 23, NA), 
  y = c(NA, NA, 32, 33, NA), 
  z = c(NA, NA, NA, 43, 44)
)

df %>%
  mutate(a = coalesce(!!!select(., -id)))

# A tibble: 5 x 6
     id     w     x     y     z     a
  <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1    10    NA    NA    NA    10
2     2    NA    21    NA    NA    21
3     3    NA    22    32    NA    22
4     4    NA    23    33    43    23
5     5    14    NA    NA    44    14
moodymudskipper commented 2 years ago

Since we're revisiting coalesce() and I see some feature requests gathered here, what about overriding other values than NAs ?

The use case is data where missing or special values are encoded as 0, -1, Inf, NaN, "non available" etc.

We have na_if() but we need to use it on all coalesced columns, and might need to turn the NAs back to their special values afterwards. It would be handy if coalesce() handled it.

jdonland commented 10 months ago

Wailing, gnashing my teeth, rending my clothing in the streets because coalesce(across(...)) still doesn't work.