tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
101 stars 15 forks source link

Implement `fct_encode_one_hot()` #215

Closed EmilHvitfeldt closed 1 year ago

EmilHvitfeldt commented 1 year ago

This PR aim to start the conversation on creating a function that takes a factor and return the corresponding integer indicator matrix. I find that I want to do this many places in tidymodels without a good way to do it.

This proposed function

Benchmarking

This new function outperforms model.matrix() for all sizes of factors that I have tested

library(hardhat)

old_fun <- function(x) {
  res <- stats::model.matrix(~y-1, data = data.frame(y = x))
  attr(res, "assign") <- NULL
  attr(res, "contrasts") <- NULL
  res
}

create_factor <- function(n, n_levels) {
  factor(sample(seq_len(n_levels), n, replace = TRUE), levels = seq_len(n_levels))
}

res <- bench::press(
  n = 10 ^ seq_len(5),
  n_levels = 10 ^ seq_len(3),
  {
    fff <- create_factor(n, n_levels)
    bench::mark(
      new = unname(get_indicators(fff)),
      old = unname(old_fun(fff))
    )
  }
)

ggplot2::autoplot(res)

model.matrix() can't do factors with many levels


get_indicators(create_factor(10, 100000)) |> str()
#>  int [1:10, 1:100000] 0 0 0 0 0 0 0 0 0 0 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:100000] "1" "2" "3" "4" ...
old_fun(create_factor(10, 100000))
#> Error: vector memory exhausted (limit reached?)

model.matrix() also couldn't calculate a vector with `10^8 elements and 20 levels.

res <- bench::press(
  n = 10 ^ seq_len(7),
  n_levels = 20,
  {
    fff <- create_factor(n, n_levels)
    bench::mark(
      min_iterations = 10,
      new = unname(get_indicators(fff)),
      old = unname(old_fun(fff))
    )
  }
)

ggplot2::autoplot(res)

Created on 2022-10-19 with reprex v2.0.2

github-actions[bot] commented 1 year ago

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.