Closed markfairbanks closed 2 years ago
Would greatly appreciate this performance improvement and currently don't see any disadvantages in replacing %in%
. Do you?
I can't think of any disadvantages either. The only (small) thing is #556 needs to be fixed as that issue currently works with base::'%in%'
.
Actually - I forgot that data.table
overrides the %in%
operator when used inside their filter. So you can't actually use a custom %in%
operator with data.table
🤦♂️
Notice how the message only prints when you change the name to %f_in%
.
pacman::p_load(data.table, vctrs)
'%in%' <- function(x, y) {
print("Using custom %in%")
if (is.character(x) && is.character(y)) {
x %chin% y
} else {
vec_in(x, y)
}
}
df <- data.table(x = 1:3)
df[x %in% c(1, 2)]
#> x
#> <int>
#> 1: 1
#> 2: 2
'%f_in%' <- function(x, y) {
print("Using custom %in%")
if (is.character(x) && is.character(y)) {
x %chin% y
} else {
vec_in(x, y)
}
}
df[x %f_in% c(1, 2)]
#> [1] "Using custom %in%"
#> x
#> <int>
#> 1: 1
#> 2: 2
I'm not 100% sure this is worth exporting since data.table
already does %in%
optimization in the background. (Even though it'd be nice to use vec_in()
on numerics.)
Another thought - it might still be worth exporting for use inside ifelse.()
or case_when.()
inside of mutate.()
Yes, that would have been my primary use case.
All set.
pacman::p_load(tidytable)
`%tidytable_in%` <- tidytable::`%in%`
`%base_in%` <- base::`%in%`
# Character ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
random <- stringi::stri_rand_strings(1000, 4)
lhs <- sample(random)
rhs <- sample(random, 1000000, TRUE)
bench::mark(
tidytable_in = lhs %tidytable_in% rhs,
base_in = lhs %base_in% rhs
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidytable_in 4.93ms 5.25ms 186. 6.17KB 0
#> 2 base_in 10.94ms 12.04ms 78.5 15.64MB 150.
# Numeric ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
random <- round(runif(1000), 4)
lhs <- sample(random)
rhs <- sample(random, 1000000, TRUE)
bench::mark(
tidytable_in = lhs %tidytable_in% rhs,
base_in = lhs %base_in% rhs
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidytable_in 8.78ms 9.37ms 106. 11.8MB 62.0
#> 2 base_in 12.87ms 13.81ms 71.7 15.6MB 71.7
Character and numeric benchmarks: