tidymodels / hardhat

Construct Modeling Packages
https://hardhat.tidymodels.org
Other
101 stars 16 forks source link

mold.recipe() turns predictors numerical variable of a tibble into integer in $blueprints$ptypes #171

Closed cregouby closed 2 years ago

cregouby commented 2 years ago

The problem

I'm having trouble with mold.recipe output $blueprint$ptypes$predictors that turns numerical double values of a tibble into integer class. Note that mold.data.frame() output $blueprint$ptypes$predictors for the same dataset correctly record them as double .

Impact

using those ptypes for dataset validation in between model pretraining and model training makes the validation fails such as in https://github.com/mlverse/tabnet/issues/66.

Reproducible example

suppressPackageStartupMessages(library(tidymodels))
data("lending_club", package = "modeldata")

rec_unsup <- recipe(Class ~ ., lending_club) %>% step_normalize(all_numeric()) 
prep_unsup <- rec_unsup %>% prep

# data.frame gives double class for numeric vectors
unsupervised_baked_df <- prep_unsup %>% bake(new_data=NULL) 
processed.data.frame <- hardhat::mold(x=unsupervised_baked_df[,-23], y=unsupervised_baked_df[,23])
processed.data.frame$blueprint$ptypes
#> $predictors
#> # A tibble: 0 × 22
#> # … with 22 variables: funded_amnt <dbl>, term <fct>, int_rate <dbl>,
#> #   sub_grade <fct>, addr_state <fct>, verification_status <fct>,
#> #   annual_inc <dbl>, emp_length <fct>, delinq_2yrs <dbl>,
#> #   inq_last_6mths <dbl>, revol_util <dbl>, acc_now_delinq <dbl>,
#> #   open_il_6m <dbl>, open_il_12m <dbl>, open_il_24m <dbl>, total_bal_il <dbl>,
#> #   all_util <dbl>, inq_fi <dbl>, inq_last_12m <dbl>, delinq_amnt <dbl>,
#> #   num_il_tl <dbl>, total_il_high_credit_limit <dbl>
#> 
#> $outcomes
#> # A tibble: 0 × 1
#> # … with 1 variable: Class <fct>

# unprep recipe gives integers as class for numeric vectors
unprepared.recipe <- hardhat::mold(rec_unsup, lending_club)
unprepared.recipe$blueprint$ptypes
#> $predictors
#> # A tibble: 0 × 22
#> # … with 22 variables: funded_amnt <int>, term <fct>, int_rate <dbl>,
#> #   sub_grade <fct>, addr_state <fct>, verification_status <fct>,
#> #   annual_inc <dbl>, emp_length <fct>, delinq_2yrs <int>,
#> #   inq_last_6mths <int>, revol_util <dbl>, acc_now_delinq <int>,
#> #   open_il_6m <int>, open_il_12m <int>, open_il_24m <int>, total_bal_il <int>,
#> #   all_util <int>, inq_fi <int>, inq_last_12m <int>, delinq_amnt <int>,
#> #   num_il_tl <int>, total_il_high_credit_limit <int>
#> 
#> $outcomes
#> # A tibble: 0 × 1
#> # … with 1 variable: Class <fct>
waldo::compare(unprepared.recipe$blueprint$ptypes, processed.data.frame$blueprint$ptypes)
#> `old$predictors$funded_amnt` is an integer vector ()
#> `new$predictors$funded_amnt` is a double vector ()
#> 
#> `old$predictors$delinq_2yrs` is an integer vector ()
#> `new$predictors$delinq_2yrs` is a double vector ()
#> 
#> `old$predictors$inq_last_6mths` is an integer vector ()
#> `new$predictors$inq_last_6mths` is a double vector ()
#> 
#> `old$predictors$acc_now_delinq` is an integer vector ()
#> `new$predictors$acc_now_delinq` is a double vector ()
#> 
#> `old$predictors$open_il_6m` is an integer vector ()
#> `new$predictors$open_il_6m` is a double vector ()
#> 
#> `old$predictors$open_il_12m` is an integer vector ()
#> `new$predictors$open_il_12m` is a double vector ()
#> 
#> `old$predictors$open_il_24m` is an integer vector ()
#> `new$predictors$open_il_24m` is a double vector ()
#> 
#> `old$predictors$total_bal_il` is an integer vector ()
#> `new$predictors$total_bal_il` is a double vector ()
#> 
#> `old$predictors$all_util` is an integer vector ()
#> `new$predictors$all_util` is a double vector ()
#> 
#> `old$predictors$inq_fi` is an integer vector ()
#> `new$predictors$inq_fi` is a double vector ()
#> 
#> And 4 more differences ...

# prep recipe gives integerss
prepared.recipe <- hardhat::mold(prep_unsup, lending_club)
prepared.recipe$blueprint$ptypes
#> $predictors
#> # A tibble: 0 × 22
#> # … with 22 variables: funded_amnt <int>, term <fct>, int_rate <dbl>,
#> #   sub_grade <fct>, addr_state <fct>, verification_status <fct>,
#> #   annual_inc <dbl>, emp_length <fct>, delinq_2yrs <int>,
#> #   inq_last_6mths <int>, revol_util <dbl>, acc_now_delinq <int>,
#> #   open_il_6m <int>, open_il_12m <int>, open_il_24m <int>, total_bal_il <int>,
#> #   all_util <int>, inq_fi <int>, inq_last_12m <int>, delinq_amnt <int>,
#> #   num_il_tl <int>, total_il_high_credit_limit <int>
#> 
#> $outcomes
#> # A tibble: 0 × 1
#> # … with 1 variable: Class <fct>

waldo::compare(prepared.recipe$blueprint$ptypes, processed.data.frame$blueprint$ptypes)
#> `old$predictors$funded_amnt` is an integer vector ()
#> `new$predictors$funded_amnt` is a double vector ()
#> 
#> `old$predictors$delinq_2yrs` is an integer vector ()
#> `new$predictors$delinq_2yrs` is a double vector ()
#> 
#> `old$predictors$inq_last_6mths` is an integer vector ()
#> `new$predictors$inq_last_6mths` is a double vector ()
#> 
#> `old$predictors$acc_now_delinq` is an integer vector ()
#> `new$predictors$acc_now_delinq` is a double vector ()
#> 
#> `old$predictors$open_il_6m` is an integer vector ()
#> `new$predictors$open_il_6m` is a double vector ()
#> 
#> `old$predictors$open_il_12m` is an integer vector ()
#> `new$predictors$open_il_12m` is a double vector ()
#> 
#> `old$predictors$open_il_24m` is an integer vector ()
#> `new$predictors$open_il_24m` is a double vector ()
#> 
#> `old$predictors$total_bal_il` is an integer vector ()
#> `new$predictors$total_bal_il` is a double vector ()
#> 
#> `old$predictors$all_util` is an integer vector ()
#> `new$predictors$all_util` is a double vector ()
#> 
#> `old$predictors$inq_fi` is an integer vector ()
#> `new$predictors$inq_fi` is a double vector ()
#> 
#> And 4 more differences ...

Created on 2021-10-22 by the reprex package (v2.0.1)

DavisVaughan commented 2 years ago

I think mold() is doing the right thing. The ptypes argument in question must correspond to the ptype of the original data supplied to mold(). It is doing that for both cases.

suppressPackageStartupMessages(library(tidymodels))
data("lending_club", package = "modeldata")

rec_unsup <- recipe(Class ~ ., lending_club) %>% step_normalize(all_numeric()) 
prep_unsup <- rec_unsup %>% prep()

#### mold.data.frame() method

unsupervised_baked_df <- prep_unsup %>% bake(new_data=NULL) 

# original data is numeric because of step_normalize() being applied already
class(unsupervised_baked_df$open_il_6m)
#> [1] "numeric"

processed.data.frame <- hardhat::mold(x=unsupervised_baked_df[,-23], y=unsupervised_baked_df[,23])

# ptype is consistent with that:
class(processed.data.frame$blueprint$ptypes$predictors$open_il_6m)
#> [1] "numeric"

#### mold.recipe() method

# original data is integer
class(lending_club$open_il_6m)
#> [1] "integer"

unprepared.recipe <- hardhat::mold(rec_unsup, lending_club)

# ptype is consistent with that:
class(unprepared.recipe$blueprint$ptypes$predictors$open_il_6m)
#> [1] "integer"

Created on 2021-10-25 by the reprex package (v2.0.1)

DavisVaughan commented 2 years ago

I think the problem in your original issue is that you are mixing the XY method of tabnet_pretrain() with the recipe method of tabnet_fit().

You should either use the recipe interfaces for both, or the XY interfaces for both, but not mixed

cregouby commented 2 years ago

Hello @DavisVaughan . You are perfectly right. I had forgot the evident assumption that normalization turns integers into numeric. Sorry for that.

And thanks for the hint on my own issue !

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.