tidymodels / butcher

Reduce the size of model objects saved to disk
https://butcher.tidymodels.org/
Other
131 stars 12 forks source link

Implement `axe_fitted.recipe` #207

Closed AshesITR closed 2 years ago

AshesITR commented 2 years ago

It should remove x$template, which contains the prepped data of the training set.

reprex stolen and adapted from tidymodels/recipes#859

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(butcher)

rec <- recipe(data = dplyr::bind_rows(rep(list(iris), 10)), formula = Species ~ .) %>%
  step_normalize(starts_with("Petal.")) %>%
  step_BoxCox(starts_with("Sepal."))

rec_prepped <- prep(rec)

lobstr::obj_size(rec_prepped)
#> 128,656 B
lobstr::obj_size(butcher(rec_prepped))
#> 70,320 B
lobstr::obj_size(butcher(prep(rec_prepped, retain = FALSE)))
#> 15,936 B

The proposed implementation is quite simple, if I'm not missing anything:

axe_fitted.recipe <- function(x, verbose = FALSE, ...) {
  old <- x
  x$template <- x$template[integer(), ]

  add_butcher_attributes(
    x,
    old,
    verbose = verbose
  )
}
juliasilge commented 2 years ago

This seems like a really good idea for butchering a recipe, like replacing the template with vctrs::vec_ptype(template):

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
data(concrete)

concrete <- 
  concrete %>% 
  group_by(across(-compressive_strength)) %>% 
  summarize(compressive_strength = mean(compressive_strength),
            .groups = "drop")

set.seed(1501)
concrete_split <- initial_split(concrete, strata = compressive_strength)
concrete_train <- training(concrete_split)
concrete_test  <- testing(concrete_split)

rec <- recipe(compressive_strength ~ ., data = concrete_train) %>%
  step_normalize(all_numeric_predictors()) %>% 
  step_poly(all_predictors()) %>% 
  step_interact(~ all_predictors():all_predictors())

prepped <- prep(rec)
bake(prepped, new_data = concrete_test)
#> # A tibble: 249 × 137
#>    compressive_strength cement_poly_1 cement_poly_2 blast_furnace_slag_poly_1
#>                   <dbl>         <dbl>         <dbl>                     <dbl>
#>  1                 4.57       -0.0632        0.0967                    0.0354
#>  2                 7.68       -0.0632        0.0967                    0.0354
#>  3                 7.72       -0.0609        0.0887                    0.0395
#>  4                20.6        -0.0609        0.0887                    0.0395
#>  5                 6.28       -0.0581        0.0794                    0.0440
#>  6                31.0        -0.0581        0.0794                    0.0440
#>  7                10.4        -0.0558        0.0716                    0.0487
#>  8                33.3        -0.0524        0.0611                    0.0584
#>  9                13.7        -0.0521        0.0600                    0.0556
#> 10                 7.51       -0.0511        0.0571                    0.0571
#> # … with 239 more rows, and 133 more variables:
#> #   blast_furnace_slag_poly_2 <dbl>, fly_ash_poly_1 <dbl>,
#> #   fly_ash_poly_2 <dbl>, water_poly_1 <dbl>, water_poly_2 <dbl>,
#> #   superplasticizer_poly_1 <dbl>, superplasticizer_poly_2 <dbl>,
#> #   coarse_aggregate_poly_1 <dbl>, coarse_aggregate_poly_2 <dbl>,
#> #   fine_aggregate_poly_1 <dbl>, fine_aggregate_poly_2 <dbl>, age_poly_1 <dbl>,
#> #   age_poly_2 <dbl>, cement_poly_1_x_cement_poly_2 <dbl>, …

prepped$template <- prepped$template[integer(), ]
juice(prepped)
#> # A tibble: 0 × 137
#> # … with 137 variables: compressive_strength <dbl>, cement_poly_1 <dbl>,
#> #   cement_poly_2 <dbl>, blast_furnace_slag_poly_1 <dbl>,
#> #   blast_furnace_slag_poly_2 <dbl>, fly_ash_poly_1 <dbl>,
#> #   fly_ash_poly_2 <dbl>, water_poly_1 <dbl>, water_poly_2 <dbl>,
#> #   superplasticizer_poly_1 <dbl>, superplasticizer_poly_2 <dbl>,
#> #   coarse_aggregate_poly_1 <dbl>, coarse_aggregate_poly_2 <dbl>,
#> #   fine_aggregate_poly_1 <dbl>, fine_aggregate_poly_2 <dbl>, …
bake(prepped, new_data = concrete_test)
#> # A tibble: 249 × 137
#>    compressive_strength cement_poly_1 cement_poly_2 blast_furnace_slag_poly_1
#>                   <dbl>         <dbl>         <dbl>                     <dbl>
#>  1                 4.57       -0.0632        0.0967                    0.0354
#>  2                 7.68       -0.0632        0.0967                    0.0354
#>  3                 7.72       -0.0609        0.0887                    0.0395
#>  4                20.6        -0.0609        0.0887                    0.0395
#>  5                 6.28       -0.0581        0.0794                    0.0440
#>  6                31.0        -0.0581        0.0794                    0.0440
#>  7                10.4        -0.0558        0.0716                    0.0487
#>  8                33.3        -0.0524        0.0611                    0.0584
#>  9                13.7        -0.0521        0.0600                    0.0556
#> 10                 7.51       -0.0511        0.0571                    0.0571
#> # … with 239 more rows, and 133 more variables:
#> #   blast_furnace_slag_poly_2 <dbl>, fly_ash_poly_1 <dbl>,
#> #   fly_ash_poly_2 <dbl>, water_poly_1 <dbl>, water_poly_2 <dbl>,
#> #   superplasticizer_poly_1 <dbl>, superplasticizer_poly_2 <dbl>,
#> #   coarse_aggregate_poly_1 <dbl>, coarse_aggregate_poly_2 <dbl>,
#> #   fine_aggregate_poly_1 <dbl>, fine_aggregate_poly_2 <dbl>, age_poly_1 <dbl>,
#> #   age_poly_2 <dbl>, cement_poly_1_x_cement_poly_2 <dbl>, …

Created on 2021-11-29 by the reprex package (v2.0.1)

juliasilge commented 2 years ago

@AshesITR would you be interested in contributing a PR to hardhat to implement this butcher method for a recipe? We have an article here with some advice on contributing to butcher, but like you have probably already discovered, the method would go in this file.

AshesITR commented 2 years ago

Sure, I'll make this a PR. Regarding df[integer(), ] vs. vctrs::vec_ptype(df): Do you have an opinion regarding any of these alternatives?

juliasilge commented 2 years ago

Do you have an opinion on that @DavisVaughan? butcher doesn't currently import vctrs but does import tibble, which imports vctrs.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.