tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.78k stars 2.12k forks source link

Vectorized operators with na.rm argument (psum(), pmult(), etc.) #968

Closed garrettgman closed 7 years ago

garrettgman commented 9 years ago

sum and + leave a gap when working with NAs:

a <- c(NA, 2, 3, NA)
b <- c(1, NA, 3, NA)
c <- c(1, 2, NA, NA)
nas <- data_frame(a, b, c)
nas
##  a  b  c
## NA  1  1
##  2 NA  2
##  3  3 NA
## NA NA NA

# no na.rm option
nas %>% mutate(d = a + b + c)
##    a  b  c  d
## 1 NA  1  1 NA
## 2  2 NA  2 NA
## 3  3  3 NA NA
## 4 NA NA NA NA

# not what we want
nas %>% mutate(d = sum(a, b, c, na.rm = TRUE))
##    a  b  c  d
## 1 NA  1  1 12
## 2  2 NA  2 12
## 3  3  3 NA 12
## 4 NA NA NA 12

Something that works like this would be useful. I'm sure there's a better way to implement it

psum <- function(..., na.rm = TRUE) {
  rowSums(as.data.frame(list(...)), na.rm = na.rm)
}
nas %>% mutate(d = psum(a, b, c))
##    a  b  c  d
## 1 NA  1  1  2
## 2  2 NA  2  4
## 3  3  3 NA  6
## 4 NA NA NA  0

Other useful functions would be psum, pprod, pmean, psd, pall, pany.

# e.g.
nas %>% filter(d = pall(is.na(a), is.na(b), is.na(c)))
##    a  b  c
## 4 NA NA NA
wdkrnls commented 9 years ago

Adding a pick (or perhaps more intelligibly pswitch) window function would also be helpful. Another feature for making these window functions more useful would be to be able specify a range of columns in mutate, just as with select. In my chemistry related data, people love to store compositional information by physical element, meaning that to get total weight I need to sum 70-100 columns.

hadley commented 8 years ago
Vector Summary Cumulative Parallel Matrix
+ sum cumsum rowSums
* prod cumprod
min cummin pmin
max cummax pmax
mean cummean rowMeans
& all cumall
| any cumany
hadley commented 8 years ago

See also http://adv-r.had.co.nz/Functionals.html#function-family

lionel- commented 8 years ago

Could be a job for purrr:

# A function operator to parallelise summary functions:
parallelise <- function(.f, ...) {
  f <- partial(.f, ...)

  function(.x, .type = NULL) {
    res <- pmap(.x, f)
    as_vector(res, .type)
  }
}

Example with NA detection:

# Creating summary function for is.na()
any_na <- partial(some, .p = is.na) %>% lift_ld()

# Parallelising any_na()
p_any_na <- parallelise(any_na)

Which gives:

any_na(1:3, 3, 5, NA)
#> TRUE

df_na <- tibble(
  ~x, ~y,
  NA, 1,
  2,  NA,
  3,  3
)
p_any_na(df)
#> [1]  TRUE  TRUE FALSE

Or should the parallelised functions take dots instead of lists?

parallelise_d <- function(.f, ...) {
  f <- partial(.f, ...)

  function(..., .type = NULL) {
    res <- pmap(list(...), f)
    as_vector(res, .type)
  }
}

psum <- parallelise_d(sum, na.rm = TRUE)
nas %>% mutate(d = psum(a, b, c))
#> Source: local data frame [4 x 4]
#>
#>       a     b     c     d
#>   (dbl) (dbl) (dbl) (dbl)
#> 1    NA     1     1     2
#> 2     2    NA     2     4
#> 3     3     3    NA     6
#> 4    NA    NA    NA     0

The list versions would work with the cols() helper mentioned in #1367

lionel- commented 8 years ago

It may need empty elements handling:

parallelise <- function(.f, ..., .empty = NULL) {
  f <- partial(.f, ...)
  force(.empty)

  function(.x, .type = NULL) {
    if (length(.x) == 0) {
      .empty
    } else {
      res <- pmap(.x, f)
      as_vector(res, .type)
    }
  }
}

p_any_na <- parallelise(any_na, .empty = FALSE)
psum <- parallelise(sum, .empty = 0)

And then scalar recycling etc.

hadley commented 8 years ago

@lionel- I was thinking these might need to be individually written in C++ for performance (but a standard matrix vectoriser would still be nice)

hadley commented 7 years ago

Moved to vctrs