tidyverse / funs

Collection of low-level functions for working with vctrs
34 stars 7 forks source link

Option to coalesce by column with data frames? #48

Closed DavisVaughan closed 2 years ago

DavisVaughan commented 4 years ago

Using the vctrs definition of a "missing row" being a missing value for data frames, coalesce() might not do what you expect. Here, only the row with all missing values is updated. It might be nice to have a way to update each column separately.

You could map2() over the data frames, but that would require that you'd already casted them to the same data frame type, and I don't think it generalizes that nicely to >2 data frames

It is possible that we need an idea of vec_coalesce() and df_coalesce() for this new case

# devtools::install_github("r-lib/funs")


df1 <- data.frame(x = c(NA, 1, NA), y = c(1, NA, NA))
df2 <- data.frame(x = c(2, 2, 2), y = c(2, 2, 2))

#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3 NA NA

coalesce(df1, df2)
#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3  2  2

Created on 2020-04-24 by the reprex package (v0.3.0)

Inspired by https://github.com/tidyverse/dplyr/pull/5142/files#diff-3680f0191de36a0e61d4b24cdb1ab150R149

rows_patch.data.frame <- function(x, y, by = NULL, ..., copy = FALSE, inplace = NULL) {
  y <- auto_copy(x, y, copy = copy)
  y_key <- df_key(y, by)
  x_key <- df_key(x, names(y_key))

  idx <- vctrs::vec_match(y[y_key], x[x_key])
  # FIXME: Check key in x? https://github.com/r-lib/vctrs/issues/1032

  # FIXME: Do we need vec_coalesce()
  new_data <- map2(x[idx, names(y)], y, coalesce)

  x[idx, names(y)] <- new_data
lionel- commented 4 years ago

Also tackled in https://github.com/tidyverse/dplyr/pull/5334

df1 <- data.frame(x = c(NA, 1, NA), y = c(1, NA, NA))
df2 <- data.frame(x = c(2, 2, 2), y = c(2, 2, 2))

dplyr::coalesce(df1, df2)
#>   x y
#> 1 2 1
#> 2 1 2
#> 3 2 2

funs::coalesce(df1, df2)
#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3  2  2

I'm tempted to generally offer a direction argument when semantics are useful across rows and across columns. But in this case, a potentially better way to tackle this is the "complete-cases" viewpoint. This might be more consistent. Currently the row-coalescence behaviour is a bit off because the target row must be completely missing, but the source row might not be:

df1 <- data.frame(x = c(NA, 1, NA), y = c(NA, NA, NA))
df2 <- data.frame(x = c(2, 2, 2), y = c(2, 2, 2))
df3 <- data.frame(x = c(NA, 3, 3), y = c(3, 3, 3))

# Only fully missing rows are coalesced
funs::coalesce(df1, df2)
#>   x  y
#> 1 2  2
#> 2 1 NA
#> 3 2  2

# But we allow partially missing coalescence
funs::coalesce(df1, df3)
#>    x  y
#> 1 NA  3
#> 2  1 NA
#> 3  3  3

# Once partially filled out, no more coalescence is possible
funs::coalesce(df1, df3, df2)
#>    x  y
#> 1 NA  3
#> 2  1 NA
#> 3  3  3

Davis will add a complete cases predicate to vctrs but how do we slice-coalesce the values? Maybe we need a binary vec_coalesce() operation?