This PR aim to start the conversation on creating a function that takes a factor and return the corresponding integer indicator matrix. I find that I want to do this many places in tidymodels without a good way to do it.
This proposed function
returns integers
doesn't rely on global options
returns sensible column names
Works for factors that are too big for model.matrix()
Doesn't work with NAs
Benchmarking
This new function outperforms model.matrix() for all sizes of factors that I have tested
library(hardhat)
old_fun <- function(x) {
res <- stats::model.matrix(~y-1, data = data.frame(y = x))
attr(res, "assign") <- NULL
attr(res, "contrasts") <- NULL
res
}
create_factor <- function(n, n_levels) {
factor(sample(seq_len(n_levels), n, replace = TRUE), levels = seq_len(n_levels))
}
res <- bench::press(
n = 10 ^ seq_len(5),
n_levels = 10 ^ seq_len(3),
{
fff <- create_factor(n, n_levels)
bench::mark(
new = unname(get_indicators(fff)),
old = unname(old_fun(fff))
)
}
)
ggplot2::autoplot(res)
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
This PR aim to start the conversation on creating a function that takes a factor and return the corresponding integer indicator matrix. I find that I want to do this many places in tidymodels without a good way to do it.
This proposed function
model.matrix()
NA
sBenchmarking
This new function outperforms
model.matrix()
for all sizes of factors that I have testedmodel.matrix()
can't do factors with many levelsmodel.matrix()
also couldn't calculate a vector with `10^8 elements and 20 levels.Created on 2022-10-19 with reprex v2.0.2