Closed jesusherranz closed 3 days ago
It appears to be working for me. Below is the result I get from fitting each of the models and using tidy()
to pull out the estimates. They differ because one has applied normalization to the data before the model and the other hasn't.
it just appears that for this data set, the model performs close to identical regardless of whether normalization was applied or not
finalize_workflow(wflow_1, select_best(tune_result_1, metric = "concordance_survival")) |>
fit(whas500) |>
tidy()
#> # A tibble: 13 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 age 0.681 0.00785
#> 2 gender -0.111 0.00785
#> 3 hr 0.254 0.00785
#> 4 sysbp 0 0.00785
#> 5 diasbp -0.216 0.00785
#> 6 bmi -0.233 0.00785
#> 7 cvd 0 0.00785
#> 8 afb 0 0.00785
#> 9 sho 0.226 0.00785
#> 10 chf 0.330 0.00785
#> 11 av3 0.0287 0.00785
#> 12 miord 0.0252 0.00785
#> 13 mitype -0.102 0.00785
finalize_workflow(wflow_2, select_best(tune_result_2, metric = "concordance_survival")) |>
fit(whas500) |>
tidy()
#> # A tibble: 13 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 age 0.0470 0.00785
#> 2 gender -0.226 0.00785
#> 3 hr 0.0108 0.00785
#> 4 sysbp 0 0.00785
#> 5 diasbp -0.0100 0.00785
#> 6 bmi -0.0430 0.00785
#> 7 cvd 0 0.00785
#> 8 afb 0 0.00785
#> 9 sho 1.10 0.00785
#> 10 chf 0.712 0.00785
#> 11 av3 0.195 0.00785
#> 12 miord 0.0531 0.00785
#> 13 mitype -0.222 0.00785
Created on 2024-10-01 with reprex v2.1.0
Hi Thank you very much for the quick response. Yes, it is true that it seems that at the end of the workflow and the model is adjusted with the optimal parameters, the models are different. However, I have re-executed my script with another well-known survival dataset, lung, and again I have the result returned by tune_grid() being the same whether normalized or not. Could it be that fit() is taking normalization into account and tune_grid() is not? Thank you very much
library(tidymodels)
library(censored)
library(smoothHR)
lung <- lung %>% drop_na() %>%
mutate(surv_var = Surv(time, status), .keep = "unused")
## resampling
set.seed(253)
cv_split <- vfold_cv(lung, v = 10 )
## elastic net
enet_spec <-
proportional_hazards(penalty = tune(), mixture = tune()) %>%
set_engine("glmnet") %>%
set_mode("censored regression")
enet_spec
## grid
enet_grid <- grid_regular(penalty(), mixture(),
levels = list(penalty = 20, mixture = 6) )
## 1.- glmnet with Normalization
obj_rec_1 <-
recipe( surv_var ~ . , data=lung ) %>%
step_normalize(all_numeric_predictors())
obj_rec_1
wflow_1 <- workflow() %>%
add_model(enet_spec) %>%
add_recipe(obj_rec_1)
tune_result_1 <- wflow_1 %>%
tune_grid( resamples = cv_split,
grid = enet_grid,
metrics = metric_set(concordance_survival))
show_best(tune_result_1, metric = "concordance_survival")
## 2.- glmnet without Normalization
obj_rec_2 <-
recipe( surv_var ~ . , data=lung )
obj_rec_2
wflow_2 <- workflow() %>%
add_model(enet_spec) %>%
add_recipe(obj_rec_2)
tune_result_2 <- wflow_2 %>%
tune_grid( resamples = cv_split,
grid = enet_grid,
metrics = metric_set(concordance_survival))
show_best(tune_result_2, metric = "concordance_survival")
Hi @EmilHvitfeldt I don't know if you've finally seen something about this problem. I think it's a bug in the normalization in the recipe within the process of tuning parameters with glmnet
Could it be that tidymodels is always using the glmnet parameter standardize=TRUE, which is the default? If this were the case, there would be no difference in standardizing before the variables or not. It would also be necessary to see if this is a problem with censored data only, or also for classification and regression data.
@jesusherranz Thanks for peeling back some layers here, going from tuning to fitting a single model. Unless documented otherwise, parsnip and censored are using glmnet defaults, including standardize = TRUE
. If you want to change that, you can set glmnet arguments in set_engine()
like this:
library(tidymodels)
library(censored)
#> Loading required package: survival
lung <- na.omit(lung) %>%
mutate(surv = Surv(time, status), .keep = "unused")
spec <- proportional_hazards(penalty = 0.1) %>%
set_engine("glmnet")
spec_no_standardize <- proportional_hazards(penalty = 0.1) %>%
set_engine("glmnet", standardize = FALSE)
set.seed(1)
fit <- spec %>% fit(surv ~ ., data = lung)
set.seed(1)
fit_no_standardize <- spec_no_standardize %>% fit(surv ~ ., data = lung)
tidy(fit)
#> [...]
#> # A tibble: 8 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 inst 0 0.1
#> 2 age 0 0.1
#> 3 sex -0.192 0.1
#> 4 ph.ecog 0.235 0.1
#> 5 ph.karno 0 0.1
#> 6 pat.karno -0.00212 0.1
#> 7 meal.cal 0 0.1
#> 8 wt.loss 0 0.1
tidy(fit_no_standardize)
#> # A tibble: 8 × 3
#> term estimate penalty
#> <chr> <dbl> <dbl>
#> 1 inst -0.0133 0.1
#> 2 age 0.0109 0.1
#> 3 sex 0 0.1
#> 4 ph.ecog 0 0.1
#> 5 ph.karno -0.00154 0.1
#> 6 pat.karno -0.0185 0.1
#> 7 meal.cal 0.0000621 0.1
#> 8 wt.loss -0.00624 0.1
Created on 2024-10-23 with reprex v2.1.0
Hi, Hannah @hfrick Everything is clearer now Thank you very much
Hello I have doubts that the normalization of the predictors is working with glmnet in the censored package. I have prepared 2 identical examples with different recipes, tuning the 2 parameters of an elastic net model. The first one, with the normalized variables, and the second one, without normalizing. Both cases, I get the same results from the optimization of the parameters, which seems strange to me.