If we are going to create a fresh data mask at each group evaluation, then mask creation needs to be very fast.
This PR helps in two ways:
Reduce the initial data mask size from 100 down to 10. I don't think that we really need such a large initial mask size, since most of the time we don't use <- in eval_tidy() calls to poke into the mask. I think most of the time the expressions we pass to eval_tidy() are function calls like eval_tidy(mean(x), mask) or something similar, so I don't think the mask env needs to be this big.
Optimize r_alloc_environment() to use R_NewEnv() on R >=4.1.0, which is a C level version of new.env().
library(rlang)
env <- new_environment()
# CRAN rlang
bench::mark(new_data_mask(env), iterations = 1000000)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new_data_mask(env) 2.08µs 2.94µs 299002. 3.48KB 18.2
# Just 100->10
bench::mark(new_data_mask(env), iterations = 1000000)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new_data_mask(env) 2µs 2.75µs 337047. 4.3KB 20.2
# Both 100->10, and `R_NewEnv()`
bench::mark(new_data_mask(env), iterations = 1000000)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new_data_mask(env) 980ns 1.39µs 646986. 4.3KB 14.9
I am hopeful that this means C calls to new_data_mask() in dplyr will be in the nanosecond range
We can probably also use this in vctrs if we wanted
This PR was motivated by this dplyr issue https://github.com/tidyverse/dplyr/issues/6666
If we are going to create a fresh data mask at each group evaluation, then mask creation needs to be very fast.
This PR helps in two ways:
<-
ineval_tidy()
calls to poke into the mask. I think most of the time the expressions we pass toeval_tidy()
are function calls likeeval_tidy(mean(x), mask)
or something similar, so I don't think the mask env needs to be this big.r_alloc_environment()
to useR_NewEnv()
on R >=4.1.0, which is a C level version ofnew.env()
.I am hopeful that this means C calls to
new_data_mask()
in dplyr will be in the nanosecond rangeWe can probably also use this in vctrs if we wanted