markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Export `%f_in%` (maybe just as `%in%`) #557

Closed markfairbanks closed 2 years ago

markfairbanks commented 2 years ago

Character and numeric benchmarks:

pacman::p_load(tidytable)

`%f_in%` <- tidytable:::`%f_in%`

# Character ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
random <- stringi::stri_rand_strings(1000, 4)
lhs <- sample(random)
rhs <- sample(random, 1000000, TRUE)

bench::mark(
  f_in = lhs %f_in% rhs,
  base_in = lhs %in% rhs
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f_in         4.72ms   5.52ms     162.     6.17KB       0 
#> 2 base_in     11.11ms  11.99ms      77.8   15.64MB     141.

# Numeric ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
random <- round(runif(1000), 4)
lhs <- sample(random)
rhs <- sample(random, 1000000, TRUE)

bench::mark(
  f_in = lhs %f_in% rhs,
  base_in = lhs %in% rhs
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f_in         8.99ms   9.57ms     101.     11.8MB     68.7
#> 2 base_in     13.22ms  13.66ms      72.8    15.6MB    153.
marianschmidt commented 2 years ago

Would greatly appreciate this performance improvement and currently don't see any disadvantages in replacing %in%. Do you?

markfairbanks commented 2 years ago

I can't think of any disadvantages either. The only (small) thing is #556 needs to be fixed as that issue currently works with base::'%in%'.

markfairbanks commented 2 years ago

Actually - I forgot that data.table overrides the %in% operator when used inside their filter. So you can't actually use a custom %in% operator with data.table 🤦‍♂️

Notice how the message only prints when you change the name to %f_in%.

pacman::p_load(data.table, vctrs)

'%in%' <- function(x, y) {
  print("Using custom %in%")
  if (is.character(x) && is.character(y)) {
    x %chin% y
  } else {
    vec_in(x, y)
  }
}

df <- data.table(x = 1:3)

df[x %in% c(1, 2)]
#>        x
#>    <int>
#> 1:     1
#> 2:     2

'%f_in%' <- function(x, y) {
  print("Using custom %in%")
  if (is.character(x) && is.character(y)) {
    x %chin% y
  } else {
    vec_in(x, y)
  }
}

df[x %f_in% c(1, 2)]
#> [1] "Using custom %in%"
#>        x
#>    <int>
#> 1:     1
#> 2:     2

I'm not 100% sure this is worth exporting since data.table already does %in% optimization in the background. (Even though it'd be nice to use vec_in() on numerics.)

markfairbanks commented 2 years ago

Another thought - it might still be worth exporting for use inside ifelse.() or case_when.() inside of mutate.()

marianschmidt commented 2 years ago

Yes, that would have been my primary use case.

markfairbanks commented 2 years ago

All set.

pacman::p_load(tidytable)

`%tidytable_in%` <- tidytable::`%in%`
`%base_in%` <- base::`%in%`

# Character ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
random <- stringi::stri_rand_strings(1000, 4)
lhs <- sample(random)
rhs <- sample(random, 1000000, TRUE)

bench::mark(
  tidytable_in = lhs %tidytable_in% rhs,
  base_in = lhs %base_in% rhs
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidytable_in   4.93ms   5.25ms     186.     6.17KB       0 
#> 2 base_in       10.94ms  12.04ms      78.5   15.64MB     150.

# Numeric ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
random <- round(runif(1000), 4)
lhs <- sample(random)
rhs <- sample(random, 1000000, TRUE)

bench::mark(
  tidytable_in = lhs %tidytable_in% rhs,
  base_in = lhs %base_in% rhs
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidytable_in   8.78ms   9.37ms     106.     11.8MB     62.0
#> 2 base_in       12.87ms  13.81ms      71.7    15.6MB     71.7