markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

vctrs::vec_fill_missing is faster/more performant than data.table::nafill for numeric #645

Closed tanho63 closed 2 years ago

tanho63 commented 2 years ago

For fill(): https://github.com/markfairbanks/tidytable/blob/91d2b26e6d11d7c70eefefe3f206ebfc77b12c04/R/fill.R#L65-L85

I think you could see a nice speed increase by simplifying to vec_fill_missing rather than the wrapper on data.table::nafill:

library(vctrs)
library(data.table)
fill_na <- function(x, direction) {
  if (is.numeric(x)) {
    if (direction %in% c("down", "up")) {
      type <- switch(direction, "down" = "locf", "up" = "nocb")

      nafill(x, type = type)
    } else {
      if (direction == "downup") {
        type1 <- "locf"
        type2 <- "nocb"
      } else {
        type1 <- "nocb"
        type2 <- "locf"
      }

      nafill(nafill(x, type = type1), type = type2)
    }
  } else {
    vec_fill_missing(x, direction = direction)
  }
}

bench::mark(
  fill_na = fill_na(c(NA,2,3,NA,5),direction = "downup"),
  nafill = nafill(c(NA,2,3,NA,5), type = "locf") |> nafill(type = "nocb"),   # test just the data.table part
  vctrs = vec_fill_missing(c(NA,2,3,NA,5), direction = "downup")
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 fill_na      31.4µs   41.8µs    22787.   49.37KB     13.7
#> 2 nafill       29.1µs   39.2µs    24574.   32.17KB     14.8
#> 3 vctrs         1.4µs    1.6µs   530898.    2.78KB      0
markfairbanks commented 2 years ago

Hmm it looks like you're right.

There were two things that I thought might cause this. 1) Your data size was small. But on a larger dataset the performance is still better. 2) direction = "downup" was pulling the data back into R an extra time compared to the vctrs code which handled it internally. But even just with direction = "down" the performance of vec_fill_missing() is better. (And it's much faster with type = "downup")

library(vctrs)
fill_na <- tidytable:::fill_na

data_size <- 10000000
vec <- sample(c(NA,2,3,NA,5), data_size, TRUE)

bench::mark(
  fill_na = fill_na(vec, direction = "downup"),
  vctrs = vec_fill_missing(vec, direction = "downup")
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 fill_na      67.1ms   67.1ms      14.9     153MB     89.4
#> 2 vctrs        25.9ms   27.1ms      36.2     153MB     42.2

bench::mark(
  fill_na = fill_na(vec, direction = "down"),
  vctrs = vec_fill_missing(vec, direction = "down")
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 fill_na      26.4ms   29.5ms      31.2    76.3MB     27.3
#> 2 vctrs        25.8ms     28ms      35.7   152.6MB     61.7
markfairbanks commented 2 years ago

Updated - thanks for catching this 😄

tanho63 commented 2 years ago

No problemo! I was trying to optimize something with data.table and was very confused about why nafill only handles numerics...so then I got into testing and comparing against your solution and was like, wait,,,,,,,