tidymodels / censored

Parsnip wrappers for survival models
https://censored.tidymodels.org/
Other
123 stars 12 forks source link

Does normalization of predictors work in glmnet with censored? #333

Closed jesusherranz closed 3 days ago

jesusherranz commented 1 month ago

Hello I have doubts that the normalization of the predictors is working with glmnet in the censored package. I have prepared 2 identical examples with different recipes, tuning the 2 parameters of an elastic net model. The first one, with the normalized variables, and the second one, without normalizing. Both cases, I get the same results from the optimization of the parameters, which seems strange to me.

library(tidymodels)
library(censored)
library(smoothHR)

whas500 <- whas500 %>%
  select(age, gender, hr, sysbp, diasbp, bmi, cvd, afb,
   sho, chf, av3, miord, mitype, lenfol, fstat) %>%
  mutate(surv_var = Surv(lenfol, fstat), .keep = "unused")

## resampling
set.seed(253)
cv_split <- vfold_cv(whas500, v = 10 )

## elastic net
enet_spec <- 
    proportional_hazards(penalty = tune(), mixture = tune()) %>%
    set_engine("glmnet") %>% 
    set_mode("censored regression") 
enet_spec

## grid
enet_grid <- grid_regular(penalty(), mixture(), 
                          levels = list(penalty = 20, mixture = 6) )  

## 1.- glmnet with Normalization
obj_rec_1 <-
  recipe( surv_var ~ . , data=whas500 ) %>%
  step_normalize(all_numeric_predictors())
obj_rec_1

wflow_1 <- workflow() %>%
  add_model(enet_spec) %>% 
  add_recipe(obj_rec_1)

tune_result_1 <- wflow_1 %>% 
  tune_grid( resamples = cv_split, 
             grid = enet_grid, 
             metrics = metric_set(concordance_survival)) 
show_best(tune_result_1, metric = "concordance_survival")

## 2.- glmnet without Normalization
obj_rec_2 <-
  recipe( surv_var ~ . , data=whas500 ) 
obj_rec_2

wflow_2 <- workflow() %>%
  add_model(enet_spec) %>% 
  add_recipe(obj_rec_2)

tune_result_2 <- wflow_2 %>% 
  tune_grid( resamples = cv_split, 
             grid = enet_grid, 
             metrics = metric_set(concordance_survival)) 
show_best(tune_result_2, metric = "concordance_survival")
EmilHvitfeldt commented 1 month ago

It appears to be working for me. Below is the result I get from fitting each of the models and using tidy() to pull out the estimates. They differ because one has applied normalization to the data before the model and the other hasn't.

it just appears that for this data set, the model performs close to identical regardless of whether normalization was applied or not

finalize_workflow(wflow_1, select_best(tune_result_1, metric = "concordance_survival")) |>
  fit(whas500) |>
  tidy()
#> # A tibble: 13 × 3
#>    term   estimate penalty
#>    <chr>     <dbl>   <dbl>
#>  1 age      0.681  0.00785
#>  2 gender  -0.111  0.00785
#>  3 hr       0.254  0.00785
#>  4 sysbp    0      0.00785
#>  5 diasbp  -0.216  0.00785
#>  6 bmi     -0.233  0.00785
#>  7 cvd      0      0.00785
#>  8 afb      0      0.00785
#>  9 sho      0.226  0.00785
#> 10 chf      0.330  0.00785
#> 11 av3      0.0287 0.00785
#> 12 miord    0.0252 0.00785
#> 13 mitype  -0.102  0.00785

finalize_workflow(wflow_2, select_best(tune_result_2, metric = "concordance_survival")) |>
  fit(whas500) |>
  tidy()
#> # A tibble: 13 × 3
#>    term   estimate penalty
#>    <chr>     <dbl>   <dbl>
#>  1 age      0.0470 0.00785
#>  2 gender  -0.226  0.00785
#>  3 hr       0.0108 0.00785
#>  4 sysbp    0      0.00785
#>  5 diasbp  -0.0100 0.00785
#>  6 bmi     -0.0430 0.00785
#>  7 cvd      0      0.00785
#>  8 afb      0      0.00785
#>  9 sho      1.10   0.00785
#> 10 chf      0.712  0.00785
#> 11 av3      0.195  0.00785
#> 12 miord    0.0531 0.00785
#> 13 mitype  -0.222  0.00785

Created on 2024-10-01 with reprex v2.1.0

jesusherranz commented 1 month ago

Hi Thank you very much for the quick response. Yes, it is true that it seems that at the end of the workflow and the model is adjusted with the optimal parameters, the models are different. However, I have re-executed my script with another well-known survival dataset, lung, and again I have the result returned by tune_grid() being the same whether normalized or not. Could it be that fit() is taking normalization into account and tune_grid() is not? Thank you very much

library(tidymodels)
library(censored)
library(smoothHR)

lung <- lung %>% drop_na() %>% 
    mutate(surv_var = Surv(time, status), .keep = "unused")

## resampling
set.seed(253)
cv_split <- vfold_cv(lung, v = 10 )

## elastic net
enet_spec <- 
    proportional_hazards(penalty = tune(), mixture = tune()) %>%
    set_engine("glmnet") %>% 
    set_mode("censored regression") 
enet_spec

## grid
enet_grid <- grid_regular(penalty(), mixture(), 
                          levels = list(penalty = 20, mixture = 6) )  

## 1.- glmnet with Normalization
obj_rec_1 <-
  recipe( surv_var ~ . , data=lung ) %>%
  step_normalize(all_numeric_predictors())
obj_rec_1

wflow_1 <- workflow() %>%
  add_model(enet_spec) %>% 
  add_recipe(obj_rec_1)

tune_result_1 <- wflow_1 %>% 
  tune_grid( resamples = cv_split, 
             grid = enet_grid, 
             metrics = metric_set(concordance_survival)) 
show_best(tune_result_1, metric = "concordance_survival")

## 2.- glmnet without Normalization
obj_rec_2 <-
  recipe( surv_var ~ . , data=lung ) 
obj_rec_2

wflow_2 <- workflow() %>%
  add_model(enet_spec) %>% 
  add_recipe(obj_rec_2)

tune_result_2 <- wflow_2 %>% 
  tune_grid( resamples = cv_split, 
             grid = enet_grid, 
             metrics = metric_set(concordance_survival)) 
show_best(tune_result_2, metric = "concordance_survival")
jesusherranz commented 1 month ago

Hi @EmilHvitfeldt I don't know if you've finally seen something about this problem. I think it's a bug in the normalization in the recipe within the process of tuning parameters with glmnet

jesusherranz commented 1 month ago

Could it be that tidymodels is always using the glmnet parameter standardize=TRUE, which is the default? If this were the case, there would be no difference in standardizing before the variables or not. It would also be necessary to see if this is a problem with censored data only, or also for classification and regression data.

hfrick commented 1 month ago

@jesusherranz Thanks for peeling back some layers here, going from tuning to fitting a single model. Unless documented otherwise, parsnip and censored are using glmnet defaults, including standardize = TRUE. If you want to change that, you can set glmnet arguments in set_engine() like this:

library(tidymodels)
library(censored)
#> Loading required package: survival

lung <- na.omit(lung) %>% 
  mutate(surv = Surv(time, status), .keep = "unused")

spec <- proportional_hazards(penalty = 0.1) %>% 
  set_engine("glmnet")
spec_no_standardize <- proportional_hazards(penalty = 0.1) %>% 
  set_engine("glmnet", standardize = FALSE)

set.seed(1)
fit <- spec %>% fit(surv ~ ., data = lung)
set.seed(1)
fit_no_standardize <- spec_no_standardize %>% fit(surv ~ ., data = lung)

tidy(fit)
#> [...]
#> # A tibble: 8 × 3
#>   term      estimate penalty
#>   <chr>        <dbl>   <dbl>
#> 1 inst       0           0.1
#> 2 age        0           0.1
#> 3 sex       -0.192       0.1
#> 4 ph.ecog    0.235       0.1
#> 5 ph.karno   0           0.1
#> 6 pat.karno -0.00212     0.1
#> 7 meal.cal   0           0.1
#> 8 wt.loss    0           0.1
tidy(fit_no_standardize)
#> # A tibble: 8 × 3
#>   term        estimate penalty
#>   <chr>          <dbl>   <dbl>
#> 1 inst      -0.0133        0.1
#> 2 age        0.0109        0.1
#> 3 sex        0             0.1
#> 4 ph.ecog    0             0.1
#> 5 ph.karno  -0.00154       0.1
#> 6 pat.karno -0.0185        0.1
#> 7 meal.cal   0.0000621     0.1
#> 8 wt.loss   -0.00624       0.1

Created on 2024-10-23 with reprex v2.1.0

jesusherranz commented 1 month ago

Hi, Hannah @hfrick Everything is clearer now Thank you very much