Closed ns-rse closed 1 month ago
I've been doing some more investigations and have found that there are potential issues with missing data for some of the variables that may be causing problems.
I've used the following script based on logistic regression models that @mdp21oe had written
NB - In order to be comparable with the recipe
/workflow
that we are investigating I have changed the data being used to train
instead of the full raw (but tidy) dataset df
. This later (i.e. df
) includes 214 individuals who are missing final_pathology
and can not be included in the analysis anyway since there is no assessment of their thyroid cancer status. Further we want to work out why the various Tidymodelling tuning doesn't work and that uses the train
subset so we need to compare like for like.
Load libraries and data, remove those with missing final_pathology
and split into train
and test
.
library(dplyr)
library(Hmisc)
library(knitr)
library(readr)
library(rmarkdown)
## Libraries for Tidymodelling
library(dials)
library(kernlab)
library(knitr)
library(tidymodels)
library(tidyverse)
library(vip)
## Load and retain only those with final_pathology
df <- readRDS(paste(r_dir, "clean.rds", sep = "/"))
df_complete <- df |>
dplyr::select(
age_at_scan,
gender,
ethnicity,
incidental_nodule,
palpable_nodule,
rapid_enlargement,
compressive_symptoms,
hypertension,
vocal_cord_paresis,
graves_disease,
hashimotos_thyroiditis,
family_history_thyroid_cancer,
exposure_radiation,
albumin,
tsh_value,
lymphocytes,
monocyte,
bta_u_classification,
solitary_nodule,
size_nodule_mm,
cervical_lymphadenopathy,
thy_classification,
final_pathology) |>
dplyr::filter(!is.na(final_pathology))
## Split data into train/test
split <- rsample::initial_split(df_complete, prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
## Setup cross-validation with k-folds (n = 10)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
## Clinical variables
glm_clin <- glm(final_pathology ~ age_at_scan + gender + incidental_nodule + palpable_nodule +
rapid_enlargement + compressive_symptoms + hashimotos_thyroiditis + family_history_thyroid_cancer + exposure_radiation +
tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy,
data = train,
family = binomial(link = "logit"))
summary(glm_clin)
Call:
glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule +
palpable_nodule + rapid_enlargement + compressive_symptoms +
hashimotos_thyroiditis + family_history_thyroid_cancer +
exposure_radiation + tsh_value + size_nodule_mm + solitary_nodule +
cervical_lymphadenopathy, family = binomial(link = "logit"),
data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.004e+00 7.626e-01 -2.627 0.008608 **
age_at_scan -2.977e-02 1.171e-02 -2.542 0.011026 *
genderM 7.614e-01 4.115e-01 1.850 0.064303 .
incidental_noduleYes -1.308e-01 4.969e-01 -0.263 0.792384
palpable_noduleYes -1.011e-01 5.328e-01 -0.190 0.849506
rapid_enlargementYes 5.313e-03 9.793e-01 0.005 0.995671
compressive_symptomsYes 1.170e-01 5.556e-01 0.211 0.833266
hashimotos_thyroiditisYes -1.459e+01 2.400e+03 -0.006 0.995149
family_history_thyroid_cancerYes 1.701e+01 2.400e+03 0.007 0.994343
exposure_radiationYes -1.489e+01 1.671e+03 -0.009 0.992889
tsh_value 1.336e-01 7.379e-02 1.811 0.070200 .
size_nodule_mm 4.478e-02 1.174e-02 3.816 0.000135 ***
solitary_noduleYes 6.990e-01 3.594e-01 1.945 0.051776 .
cervical_lymphadenopathyYes 3.017e+00 9.198e-01 3.280 0.001038 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 286.13 on 325 degrees of freedom
Residual deviance: 223.54 on 312 degrees of freedom
(536 observations deleted due to missingness)
AIC: 251.54
Number of Fisher Scoring iterations: 15
glm_clin
Call: glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule +
palpable_nodule + rapid_enlargement + compressive_symptoms +
hashimotos_thyroiditis + family_history_thyroid_cancer +
exposure_radiation + tsh_value + size_nodule_mm + solitary_nodule +
cervical_lymphadenopathy + incidental_nodule + tsh_value +
size_nodule_mm + solitary_nodule + bta_u_classification,
family = binomial(link = "logit"), data = train)
Coefficients:
(Intercept) age_at_scan genderM
-2.98963 -0.02312 0.61605
incidental_noduleYes palpable_noduleYes rapid_enlargementYes
-0.43710 -0.44385 -0.18599
compressive_symptomsYes hashimotos_thyroiditisYes family_history_thyroid_cancerYes
0.22305 -16.28615 16.10881
exposure_radiationYes tsh_value size_nodule_mm
-17.31802 0.15792 0.03851
solitary_noduleYes cervical_lymphadenopathyYes bta_u_classificationU3
0.77457 2.03330 1.94868
bta_u_classificationU4 bta_u_classificationU5
3.12813 17.74532
Degrees of Freedom: 312 Total (i.e. Null); 296 Residual
(549 observations deleted due to missingness) <<<<<<<<< IMPORTANT!!!
Null Deviance: 275
Residual Deviance: 180.3 AIC: 214.3
549 observations dropped due to missing data.
glm_biochem <- glm(final_pathology ~ age_at_scan + gender + tsh_value + albumin + lymphocytes + monocyte,
data = train,
family = binomial(link = "logit")
)
summary(glm_biochem)
Call:
glm(formula = final_pathology ~ age_at_scan + gender + tsh_value +
albumin + lymphocytes + monocyte, family = binomial(link = "logit"),
data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.73160 2.67009 -0.649 0.516652
age_at_scan -0.03118 0.01122 -2.780 0.005441 **
genderM 1.34973 0.39931 3.380 0.000725 ***
tsh_value 0.11219 0.07189 1.561 0.118623
albumin 0.04716 0.05323 0.886 0.375708
lymphocytes 0.27508 0.26067 1.055 0.291292
monocyte -3.82506 1.27730 -2.995 0.002748 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 264.17 on 347 degrees of freedom
Residual deviance: 227.00 on 341 degrees of freedom
(514 observations deleted due to missingness)
AIC: 241
Number of Fisher Scoring iterations: 6
glm_biochem
Call: glm(formula = final_pathology ~ age_at_scan + gender + tsh_value +
albumin + lymphocytes + monocyte, family = binomial(link = "logit"),
data = train)
Coefficients:
(Intercept) age_at_scan genderM tsh_value albumin lymphocytes monocyte
-1.73160 -0.03118 1.34973 0.11219 0.04716 0.27508 -3.82506
Degrees of Freedom: 347 Total (i.e. Null); 341 Residual
(514 observations deleted due to missingness) <<<<<<<<< IMPORTANT!!!
Null Deviance: 264.2
Residual Deviance: 227 AIC: 241
514 observations deleted due to missing data.
glm_ultrasound <- glm(
final_pathology ~ age_at_scan + gender + incidental_nodule + tsh_value + size_nodule_mm +
solitary_nodule + cervical_lymphadenopathy + bta_u_classification, # + thy_classification,
data = train,
family = binomial(link = "logit")
)
summary(glm_ultrasound)
Call:
glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule +
tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy +
bta_u_classification, family = binomial(link = "logit"),
data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.92562 0.67664 -4.324 1.53e-05 ***
age_at_scan -0.02624 0.01140 -2.302 0.02135 *
genderM 0.62732 0.40513 1.548 0.12152
incidental_noduleYes -0.35053 0.38496 -0.911 0.36252
tsh_value 0.14586 0.07193 2.028 0.04257 *
size_nodule_mm 0.03141 0.01068 2.940 0.00328 **
solitary_noduleYes 0.32866 0.36267 0.906 0.36482
cervical_lymphadenopathyYes 0.64249 0.88127 0.729 0.46597
bta_u_classificationU3 1.98761 0.38883 5.112 3.19e-07 ***
bta_u_classificationU4 3.15909 0.69606 4.539 5.67e-06 ***
bta_u_classificationU5 18.13639 1164.29414 0.016 0.98757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 329.11 on 404 degrees of freedom
Residual deviance: 225.02 on 394 degrees of freedom
(457 observations deleted due to missingness)
AIC: 247.02
Number of Fisher Scoring iterations: 15
glm_ultrasound
Call: glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule +
tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy +
bta_u_classification, family = binomial(link = "logit"),
data = train)
Coefficients:
(Intercept) age_at_scan genderM incidental_noduleYes
-2.92562 -0.02624 0.62732 -0.35053
tsh_value size_nodule_mm solitary_noduleYes cervical_lymphadenopathyYes
0.14586 0.03141 0.32866 0.64249
bta_u_classificationU3 bta_u_classificationU4 bta_u_classificationU5
1.98761 3.15909 18.13639
Degrees of Freedom: 404 Total (i.e. Null); 394 Residual
(457 observations deleted due to missingness) <<<<<<<<< IMPORTANT!!!
Null Deviance: 329.1
Residual Deviance: 225 AIC: 247
457 observations dropped due to missing data.
In the Tidymodeling we are attempting to use a saturated model with all variables because as can be seen from above the coefficients vary wildly depending on the other co-variates. If we look at the values for age_at_scan
and gender
which are included in all and they are bouncing around all over the place. What we want is a fully adjusted estimate for each co-variate taking into account all the other variables
glm_all <- glm(
final_pathology ~ age_at_scan + gender + incidental_nodule + palpable_nodule +
rapid_enlargement + compressive_symptoms + hashimotos_thyroiditis + family_history_thyroid_cancer + exposure_radiation +
tsh_value +
size_nodule_mm +
solitary_nodule +
cervical_lymphadenopathy +
albumin +
lymphocytes +
monocyte +
incidental_nodule +
tsh_value +
size_nodule_mm +
solitary_nodule +
bta_u_classification,
data = train,
family = binomial(link = "logit")
)
glm_all <- glm(
+ final_pathology ~ age_at_scan +
+ gender +
+ incidental_nodule +
+ palpable_nodule +
+ rapid_enlargement +
+ compressive_symptoms +
+ hashimotos_thyroiditis +
+ family_history_thyroid_cancer +
+ exposure_radiation +
+ tsh_value +
+ size_nodule_mm +
+ solitary_nodule +
+ cervical_lymphadenopathy +
+ albumin +
+ lymphocytes +
+ monocyte +
+ incidental_nodule +
+ tsh_value +
+ size_nodule_mm +
+ solitary_nodule +
+ bta_u_classification,
+ data = train,
+ family = binomial(link = "logit")
+ )
+ + + + + + + + + + + + + + + + + + + + + + + + Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Oh no our model didn't fit we've got a weird error cropping up here?!?!?
Looking at the pattern of missing data (see above graph and tables in current rendering of index.qmd
) we see the variables with the most missing data are...
Variable | N | % |
---|---|---|
thy_classification |
915 | 79.6 |
albumin |
515 | 44.8 |
tsh_value |
413 | 35.9 |
monocyte |
363 | 31.6 |
lymphocytes |
359 | 31.2 |
size_nodule_mm |
319 | 27.7 |
family_history_thyroid_cancer |
281 | 24.4 |
We have already excluded thy_classification
in the above but there are some others with very high levels of missing data, namely the biochemical markers obtained from blood assays albumin
, lymphocytes
, monocyte
and tsh_value
.
If individuals are excluded based on these and it leaves a individuals who for one of the other binary variables are invariant, e.g. exposure_radiation
where only 6
individuals are Yes
and the rest are No
then the glm()
model will complain with the above message that contrasts can be applied only to factors with 2 or more levels
because we have to have variation in our predictor for it to be used in the model (if the subset that remains after dropping out those with missing across all factors are all exposure_radiation = No
then we will see this sort of error, although it's worth noting I've not investigated whether this is the problem variable).
But the biochemical markers are also problematic because they have so much missing data. If we exclude these biochemical markers and the exposure_radiation
too for good measure as there are so few people with Yes
we can fit a saturated model.
## Saturated model
glm_all <- glm(
final_pathology ~ age_at_scan +
gender +
incidental_nodule +
palpable_nodule +
rapid_enlargement +
compressive_symptoms +
hashimotos_thyroiditis +
family_history_thyroid_cancer +
## exposure_radiation + <<< We can easily exclude variables by laying out code in
## tsh_value + <<< this manner and adding comments to start of lines ;-)
size_nodule_mm +
solitary_nodule +
cervical_lymphadenopathy +
## albumin +
## lymphocytes +
## monocyte +
incidental_nodule +
size_nodule_mm +
solitary_nodule +
bta_u_classification,
data = train,
family = binomial(link = "logit")
)
summary(glm_all)
Call:
glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule +
palpable_nodule + rapid_enlargement + compressive_symptoms +
hashimotos_thyroiditis + family_history_thyroid_cancer +
size_nodule_mm + solitary_nodule + cervical_lymphadenopathy +
incidental_nodule + size_nodule_mm + solitary_nodule + bta_u_classification,
family = binomial(link = "logit"), data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.75668 0.75789 -3.637 0.000276 ***
age_at_scan -0.02455 0.01143 -2.147 0.031782 *
genderM 0.49636 0.40278 1.232 0.217822
incidental_noduleYes -0.61578 0.49288 -1.249 0.211540
palpable_noduleYes -0.18385 0.51993 -0.354 0.723637
rapid_enlargementYes -0.27294 1.08007 -0.253 0.800497
compressive_symptomsYes -0.12914 0.57947 -0.223 0.823640
hashimotos_thyroiditisYes -15.60853 2399.54500 -0.007 0.994810
family_history_thyroid_cancerYes 0.82812 2.04767 0.404 0.685904
size_nodule_mm 0.03477 0.01109 3.136 0.001711 **
solitary_noduleYes 0.67604 0.36639 1.845 0.065018 .
cervical_lymphadenopathyYes 1.32420 0.98795 1.340 0.180132
bta_u_classificationU3 2.16334 0.39294 5.506 3.68e-08 ***
bta_u_classificationU4 3.41627 0.75777 4.508 6.53e-06 ***
bta_u_classificationU5 18.92801 954.07210 0.020 0.984172
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 338.70 on 437 degrees of freedom
Residual deviance: 219.56 on 423 degrees of freedom
(424 observations deleted due to missingness)
AIC: 249.56
Number of Fisher Scoring iterations: 15
glm_all
>
Call: glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule +
palpable_nodule + rapid_enlargement + compressive_symptoms +
hashimotos_thyroiditis + family_history_thyroid_cancer +
size_nodule_mm + solitary_nodule + cervical_lymphadenopathy +
incidental_nodule + size_nodule_mm + solitary_nodule + bta_u_classification,
family = binomial(link = "logit"), data = train)
Coefficients:
(Intercept) age_at_scan genderM
-2.75668 -0.02455 0.49636
incidental_noduleYes palpable_noduleYes rapid_enlargementYes
-0.61578 -0.18385 -0.27294
compressive_symptomsYes hashimotos_thyroiditisYes family_history_thyroid_cancerYes
-0.12914 -15.60853 0.82812
size_nodule_mm solitary_noduleYes cervical_lymphadenopathyYes
0.03477 0.67604 1.32420
bta_u_classificationU3 bta_u_classificationU4 bta_u_classificationU5
2.16334 3.41627 18.92801
Degrees of Freedom: 437 Total (i.e. Null); 423 Residual
(424 observations deleted due to missingness)
Null Deviance: 338.7
Residual Deviance: 219.6 AIC: 249.6
424 observations are still dropped due to missing data though!!!
This hasn't resolved all of the problems but is something to bear in mind.
I've asked a question on the Data Science Learning Slack channel (specifically in the #help-r-model
channel, but the first link in this paragraph will take you there, although you will need to sign up to the Slack channel first @mdp21oe).
Copy of question and perhaps clearer workings of a question I posted to the Data Science Learning Community Slack channel (need to sign up to the Slack community to view hence posting here for reference).
Hi,
I'm trying to wrangle a strange problem I've encountered trying to use the Tidymodels framework. If anyone has any thoughts or ideas as to what might be going on I would be very grateful.
Data pertains to Thyroid cancer (final_pathology
) and there are a bunch of covariates and large amounts of missing data but I can demonstrate the problem with just two co-variates incidental_nodule
and palpable_nodule
.
I can fit models for each of these covariates independently but not together.
Import the data, remove those with missing final_pathology
and split into train
/test
df <- readRDS(paste(r_dir, "clean.rds", sep = "/"))
df_complete <- df |>
dplyr::select(
incidental_nodule,
palpable_nodule,
final_pathology) |>
dplyr::filter(!is.na(final_pathology))
## Split data into train/test
split <- rsample::initial_split(df_complete, prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
glm()
First fit using glm()
...
glm_incidental <- glm(
final_pathology ~ incidental_nodule,
data = train,
family = binomial(link = "logit")
)
summary(glm_incidental)
glm_incidental
Call:
glm(formula = final_pathology ~ incidental_nodule, family = binomial(link = "logit"),
data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7800 0.1432 -12.431 < 2e-16 ***
incidental_noduleYes -1.1287 0.2538 -4.446 8.73e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 536.64 on 858 degrees of freedom
Residual deviance: 514.94 on 857 degrees of freedom
(3 observations deleted due to missingness)
AIC: 518.94
Number of Fisher Scoring iterations: 5
>
Call: glm(formula = final_pathology ~ incidental_nodule, family = binomial(link = "logit"),
data = train)
Coefficients:
(Intercept) incidental_noduleYes
-1.780 -1.129
Degrees of Freedom: 858 Total (i.e. Null); 857 Residual
(3 observations deleted due to missingness)
Null Deviance: 536.6
Residual Deviance: 514.9 AIC: 518.9
Now fit with Tidymodels...
recipe_incidental <- recipes::recipe(final_pathology ~ incidental_nodule, data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_incidental <- workflows::workflow() |>
workflows::add_recipe(recipe_incidental)
## Model
logistic_model <- logistic_reg() |>
set_engine("glm")
log_incidental <- workflow_incidental |>
add_model(logistic_model)
## Fit
log_incidental_fit <- fit(log_incidental, data = train)
log_incidental_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_filter_missing()
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) incidental_nodule_Yes
-1.780 -1.129
Degrees of Freedom: 858 Total (i.e. Null); 857 Residual
(3 observations deleted due to missingness)
Null Deviance: 536.6
Residual Deviance: 514.9 AIC: 518.9
Great we can fit a glm with both glm()
and a Tidymodels recipe and workflow for `incidental_nodule** and we get the
same coefficients fitted.
glm()
glm_palpable <- glm(
final_pathology ~ palpable_nodule,
data = train,
family = binomial(link = "logit")
)
summary(glm_palpable)
glm_palpable
Call:
glm(formula = final_pathology ~ palpable_nodule, family = binomial(link = "logit"),
data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.0975 0.2231 -13.887 < 2e-16 ***
palpable_noduleYes 1.5238 0.2663 5.721 1.06e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 514.94 on 817 degrees of freedom
Residual deviance: 477.51 on 816 degrees of freedom
(44 observations deleted due to missingness)
AIC: 481.51
Number of Fisher Scoring iterations: 5
>
Call: glm(formula = final_pathology ~ palpable_nodule, family = binomial(link = "logit"),
data = train)
Coefficients:
(Intercept) palpable_noduleYes
-3.098 1.524
Degrees of Freedom: 817 Total (i.e. Null); 816 Residual
(44 observations deleted due to missingness)
Null Deviance: 514.9
Residual Deviance: 477.5 AIC: 481.5
recipe_palpable <- recipes::recipe(final_pathology ~ palpable_nodule, data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_palpable <- workflows::workflow() |>
workflows::add_recipe(recipe_palpable)
## Model
logistic_model <- logistic_reg() |>
set_engine("glm")
log_palpable <- workflow_palpable |>
add_model(logistic_model)
## Fit
log_palpable_fit <- fit(log_palpable, data = train)
log_palpable_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_filter_missing()
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) palpable_nodule_Yes
-3.098 1.524
Degrees of Freedom: 817 Total (i.e. Null); 816 Residual
(44 observations deleted due to missingness)
Null Deviance: 514.9
Residual Deviance: 477.5 AIC: 481.5
Again we have the same model fitted which is good and as I would hope for.
Now lets try both incidental_nodule
and palpable_nodule
in the model. Start with a basic call to glm()
first.
glm()
glm_incidental_palpable <- glm(
final_pathology ~ incidental + palpable_nodule,
data = train,
family = binomial(link = "logit")
)
summary(glm_incidental_palpable)
glm_incidental_palpable
Call:
glm(formula = final_pathology ~ incidental_nodule + palpable_nodule,
family = binomial(link = "logit"), data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7353 0.3646 -7.502 6.29e-14 ***
incidental_noduleYes -0.4473 0.3708 -1.207 0.22762
palpable_noduleYes 1.2076 0.3692 3.271 0.00107 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 514.94 on 817 degrees of freedom
Residual deviance: 476.03 on 815 degrees of freedom
(44 observations deleted due to missingness)
AIC: 482.03
Number of Fisher Scoring iterations: 5
>
Call: glm(formula = final_pathology ~ incidental_nodule + palpable_nodule,
family = binomial(link = "logit"), data = train)
Coefficients:
(Intercept) incidental_noduleYes palpable_noduleYes
-2.7353 -0.4473 1.2076
Degrees of Freedom: 817 Total (i.e. Null); 815 Residual
(44 observations deleted due to missingness)
Null Deviance: 514.9
Residual Deviance: 476 AIC: 482
Great we can fit a model using two co-variates lets try the Tidymodels approach...
recipe_incidental_palpable <- recipes::recipe(final_pathology ~ incidental_nodule + palpable_nodule, data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_incidental_palpable <- workflows::workflow() |>
workflows::add_recipe(recipe_incidental_palpable)
## Model
logistic_model <- logistic_reg() |>
set_engine("glm")
log_incidental_palpable <- workflow_incidental_palpable |>
add_model(logistic_model)
## Fit
log_incidental_palpable_fit <- fit(log_incidental_palpable, data = train)
log_incidental_palpable_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_filter_missing()
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept)
-2.266
Degrees of Freedom: 861 Total (i.e. Null); 861 Residual
Null Deviance: 537.2
Residual Deviance: 537.2 AIC: 539.2
Oh no!!! Where have our estimate for the co-variates gone? We could fit a model using glm()
directly but not using the
Tidymodels recipe()
+ workflow()
+ model()
+ fit()
.
I'm baffled as to why this happens. If I substitute one of the other covariates (age_at_scan
) it is retained but the
incidental_nodule
or palpable_nodule
(or both) are dropped.
glm_age_palpable <- glm(
final_pathology ~ age_at_scan+ palpable_nodule,
data = train,
family = binomial(link = "logit")
)
summary(glm_age_palpable)
glm_age_palpable
Call: glm(formula = final_pathology ~ age_at_scan + palpable_nodule,
family = binomial(link = "logit"), data = train)
Coefficients:
(Intercept) age_at_scan palpable_noduleYes
-1.75935 -0.02615 1.49668
Degrees of Freedom: 817 Total (i.e. Null); 815 Residual
(44 observations deleted due to missingness)
Null Deviance: 514.9
Residual Deviance: 464.9 AIC: 470.9
Get coefficients for both age_at_scan
and palpable_nodule
but with Tidymodels the later goes missing.
recipe_age_palpable <- recipes::recipe(final_pathology ~ age_at_scan + palpable_nodule, data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_age_palpable <- workflows::workflow() |>
workflows::add_recipe(recipe_age_palpable)
## Model
logistic_model <- logistic_reg() |>
set_engine("glm")
log_age_palpable <- workflow_age_palpable |>
add_model(logistic_model)
## Fit
log_age_palpable_fit <- fit(log_age_palpable, data = train)
log_age_palpable_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_filter_missing()
• step_normalize()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) age_at_scan
-2.3489 -0.4525
Degrees of Freedom: 861 Total (i.e. Null); 860 Residual
Null Deviance: 537.2
Residual Deviance: 522.4 AIC: 526.4
>
I can see the degrees of freedom differ, presumably because palpable_nodule
is being dropped.
I've been working on imputing missing values and have therefore decided to test whether we see the same strange behaviour on a dataset without any missing data (i.e. after imputation).
The following code imputes a single data set using the Predictive Mean Matching method implemented in mice.
It then attempts to fit a LASSO model using the TidyModels approach.
library(dplyr)
library(Hmisc)
library(knitr)
library(readr)
library(rmarkdown)
## Libraries for Tidymodelling
library(dials)
library(kernlab)
library(knitr)
library(mice)
library(stringr)
library(tidymodels)
library(tidyverse)
library(vip)
## Load and retain only those with final_pathology
df <- readRDS(paste(r_dir, "clean.rds", sep = "/"))
df_complete <- df |>
dplyr::select(
age_at_scan,
gender,
ethnicity,
incidental_nodule,
palpable_nodule,
rapid_enlargement,
compressive_symptoms,
hypertension,
vocal_cord_paresis,
graves_disease,
hashimotos_thyroiditis,
family_history_thyroid_cancer,
exposure_radiation,
albumin,
tsh_value,
lymphocytes,
monocyte,
bta_u_classification,
solitary_nodule,
size_nodule_mm,
cervical_lymphadenopathy,
thy_classification,
final_pathology) |>
dplyr::filter(!is.na(final_pathology))
##########################################
## 2024-07-18 - Testing with an imputed data set
##
## Impute data
imputations = 1
cores = imputations
mice_pmm_single <- df_complete |>
dplyr::select(-final_pathology) |> ## We deliberately exclude the outcome of interest
mice::futuremice(m = imputations, method = "pmm", n.core = cores) |>
mice::complete(action = "long", include = FALSE)
## Bind the final_pathology, removing the .id and .imp columns too and renaming columns
mice_pmm_single <- cbind(mice_pmm_single, df_complete$final_pathology) |>
dplyr::select(-.id)
colnames(mice_pmm_single) <- stringr::str_replace(colnames(mice_pmm_single), "df_complete\\$", "")
## Split a single imputed data set into train/test
split <- mice_pmm_single |>
dplyr::filter(.imp == 1) |>
dplyr::select(-.imp) |>
rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
## Setup cross-validation with k-folds (n = 10)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
## Setup a recipe and workflow
thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
## @ns-rse 2024-06-14 :
## This step can be used to filter observations with missing data, see the manual pages for more details
## https://recipes.tidymodels.org/reference/step_filter_missing.html
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
## @ns-rse 2024-06-14 :
## We first normalise the data _before_ we generate dummies otherwise the dummies, which are numerical, get normalised
## too
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
thyroid_workflow <- workflows::workflow() |>
workflows::add_recipe(thyroid_recipe)
## Try tuning a LASSO model
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
parsnip::set_engine("glmnet")
## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
resamples = cv_folds,
grid = dials::grid_regular(penalty(), levels = 50)
)
## K-fold best fit for LASSO
lasso_kfold_roc_auc <- lasso_grid |>
tune::select_best(metric = "roc_auc")
## Fit the final LASSO model
final_lasso_kfold <- tune::finalize_workflow(
workflows::add_model(thyroid_workflow, tune_spec_lasso),
lasso_kfold_roc_auc)
final_lasso_kfold |>
fit(train) |>
hardhat::extract_fit_parsnip() |>
vip::vi(lambda = lasso_kfold_roc_auc$penalty) |>
dplyr::mutate(
Importance = abs(Importance),
Variable = fct_reorder(Variable, Importance)
) |>
ggplot(mapping = aes(x = Importance, y = Variable, fill = Sign)) +
geom_col() +
dark_theme_minimal()
This "works" as the "importance" plot generated at the end includes all of the co-variates and not just two....
Thus there is a problem somewhere in the thyroid_recipe
that we setup with the code below when applied to the real-life dataset.
thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
## @ns-rse 2024-06-14 :
## This step can be used to filter observations with missing data, see the manual pages for more details
## https://recipes.tidymodels.org/reference/step_filter_missing.html
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
## @ns-rse 2024-06-14 :
## We first normalise the data _before_ we generate dummies otherwise the dummies, which are numerical, get normalised
## too
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
There are three components to the recipe here...
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0)
recipes::step_normalize(recipes::all_numeric_predictors())
recipes::step_dummy(recipes::all_nominal_predictors())
Lets consider them in turn.
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0)
This first step is saying...
Filter missing for all predictors, and remove variables that have any missing data (the
threshold = 0
bit)
This may be the problematic step since it will remove any predictor where there is any missing data.
The following demonstrates that it is definitely the recipes::step_filter_missing()
step that is causing problem.
The steps take the recipes::recipe()
and then manually recipes::prep(https://recipes.tidymodels.org/reference/prep.html)
it before recipes::bake()
(as noted in the documentation here the recommended approach is to use workflows::workflow()
as we have done).
str()
shows the str
ucture of an R object and this demonstrates that with a threshold = 0
or threshold = 0.5
for removing a column with missing data both remove all but age
and gender_M
.
Almost all variables are removed...
split <- df_complete|>
rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
+ > > > ## Setup a recipe and workflow with all steps
thyroid_recipe_all<- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::prep()
all <- thyroid_recipe_all |> bake(new_data = NULL)
str(all)
> + + + + > > tibble [862 × 3] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
..- attr(*, "label")= chr "Age"
$ final_pathology: Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
$ gender_M : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
>
threshold = 0.5
With a higher threshold
for removing variables (has to have >= 50% missing) more variables are retained.
> thyroid_recipe_high_missing_threshold<- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0.5) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::prep()
high_missing_threshold <- thyroid_recipe_high_missing_threshold |> bake(new_data = NULL)
str(high_missing_threshold)
+ + + + > > tibble [862 × 39] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
..- attr(*, "label")= chr "Age"
$ albumin : 'labelled' num [1:862] 0.598 0.334 NA 1.126 -0.986 ...
..- attr(*, "label")= chr "Albumin"
$ tsh_value : 'labelled' num [1:862] -0.737 NA -0.969 0.535 -0.969 ...
..- attr(*, "label")= chr "TSH value"
$ lymphocytes : 'labelled' num [1:862] NA 1.421 0.518 -0.453 0.368 ...
..- attr(*, "label")= chr "Lymphocytes"
$ monocyte : 'labelled' num [1:862] NA 1.4018 -0.0807 -0.0313 0.0181 ...
..- attr(*, "label")= chr "Monocytes"
$ size_nodule_mm : 'labelled' num [1:862] NA 1.164 0.362 2.274 -0.316 ...
..- attr(*, "label")= chr "Nodule size (mm)"
$ final_pathology : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
$ gender_M : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
$ ethnicity_B : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_C : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
$ ethnicity_D : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_F : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_G : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_H : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_J : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_K : num [1:862] 1 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_L : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_M : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_N : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_P : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_R : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_S : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
$ ethnicity_Z : num [1:862] 0 0 0 1 0 0 0 0 0 0 ...
$ incidental_nodule_Yes : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
$ palpable_nodule_Yes : num [1:862] 0 1 0 1 0 1 0 0 0 1 ...
$ rapid_enlargement_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ compressive_symptoms_Yes : num [1:862] 0 0 0 1 0 0 0 0 0 1 ...
$ hypertension_Yes : num [1:862] 0 1 0 0 0 0 0 NA 0 0 ...
$ vocal_cord_paresis_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ graves_disease_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ hashimotos_thyroiditis_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 0 0 0 1 NA 0 0 ...
$ exposure_radiation_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ bta_u_classification_U2 : num [1:862] 1 0 NA 1 0 0 1 1 0 0 ...
$ bta_u_classification_U3 : num [1:862] 0 1 NA 0 1 1 0 0 1 1 ...
$ bta_u_classification_U4 : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
$ bta_u_classification_U5 : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
$ solitary_nodule_Yes : num [1:862] 0 1 0 0 0 1 0 0 1 0 ...
$ cervical_lymphadenopathy_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
recipes::step_filter_missing()
completelyIf we remove the filtering of missing step completely all variables are retained.
> thyroid_recipe_skip_filter_missing <- recipes::recipe(final_pathology ~ ., data = train) |>
## recipes::step_filter_missing(recipes::all_predictors(), threshold = 0, skip = TRUE) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::prep()
skip_filter_missing<- thyroid_recipe_skip_filter_missing |> bake(new_data = NULL)
str(skip_filter_missing)
> > + + + + > > tibble [862 × 46] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
..- attr(*, "label")= chr "Age"
$ albumin : 'labelled' num [1:862] 0.598 0.334 NA 1.126 -0.986 ...
..- attr(*, "label")= chr "Albumin"
$ tsh_value : 'labelled' num [1:862] -0.737 NA -0.969 0.535 -0.969 ...
..- attr(*, "label")= chr "TSH value"
$ lymphocytes : 'labelled' num [1:862] NA 1.421 0.518 -0.453 0.368 ...
..- attr(*, "label")= chr "Lymphocytes"
$ monocyte : 'labelled' num [1:862] NA 1.4018 -0.0807 -0.0313 0.0181 ...
..- attr(*, "label")= chr "Monocytes"
$ size_nodule_mm : 'labelled' num [1:862] NA 1.164 0.362 2.274 -0.316 ...
..- attr(*, "label")= chr "Nodule size (mm)"
$ final_pathology : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
$ gender_M : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
$ ethnicity_B : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_C : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
$ ethnicity_D : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_F : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_G : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_H : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_J : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_K : num [1:862] 1 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_L : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_M : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_N : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_P : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_R : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_S : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
$ ethnicity_Z : num [1:862] 0 0 0 1 0 0 0 0 0 0 ...
$ incidental_nodule_Yes : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
$ palpable_nodule_Yes : num [1:862] 0 1 0 1 0 1 0 0 0 1 ...
$ rapid_enlargement_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ compressive_symptoms_Yes : num [1:862] 0 0 0 1 0 0 0 0 0 1 ...
$ hypertension_Yes : num [1:862] 0 1 0 0 0 0 0 NA 0 0 ...
$ vocal_cord_paresis_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ graves_disease_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ hashimotos_thyroiditis_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 0 0 0 1 NA 0 0 ...
$ exposure_radiation_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ bta_u_classification_U2 : num [1:862] 1 0 NA 1 0 0 1 1 0 0 ...
$ bta_u_classification_U3 : num [1:862] 0 1 NA 0 1 1 0 0 1 1 ...
$ bta_u_classification_U4 : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
$ bta_u_classification_U5 : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
$ solitary_nodule_Yes : num [1:862] 0 1 0 0 0 1 0 0 1 0 ...
$ cervical_lymphadenopathy_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy1c : num [1:862] NA 0 NA NA 0 0 NA NA NA 1 ...
$ thy_classification_Thy2 : num [1:862] NA 1 NA NA 0 0 NA NA NA 0 ...
$ thy_classification_Thy2c : num [1:862] NA 0 NA NA 0 0 NA NA NA 0 ...
$ thy_classification_Thy3a : num [1:862] NA 0 NA NA 1 0 NA NA NA 0 ...
$ thy_classification_Thy3f : num [1:862] NA 0 NA NA 0 1 NA NA NA 0 ...
$ thy_classification_Thy4 : num [1:862] NA 0 NA NA 0 0 NA NA NA 0 ...
$ thy_classification_Thy5 : num [1:862] NA 0 NA NA 0 0 NA NA NA 0 ...
>
threshold = 0.1
If we exclude variables with > 10% of observations missing we retain the following variables.
> thyroid_recipe_investigate_missing_threshold<- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0.1) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::prep()
investigate_missing_threshold <- thyroid_recipe_investigate_missing_threshold |> bake(new_data = NULL)
str(investigate_missing_threshold)
+ + + + > > tibble [862 × 31] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
..- attr(*, "label")= chr "Age"
$ final_pathology : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
$ gender_M : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
$ ethnicity_B : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_C : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
$ ethnicity_D : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_F : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_G : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_H : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_J : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_K : num [1:862] 1 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_L : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_M : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_N : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_P : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_R : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_S : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
$ ethnicity_Z : num [1:862] 0 0 0 1 0 0 0 0 0 0 ...
$ incidental_nodule_Yes : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
$ palpable_nodule_Yes : num [1:862] 0 1 0 1 0 1 0 0 0 1 ...
$ rapid_enlargement_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ compressive_symptoms_Yes : num [1:862] 0 0 0 1 0 0 0 0 0 1 ...
$ vocal_cord_paresis_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ graves_disease_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ hashimotos_thyroiditis_Yes : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
$ bta_u_classification_U2 : num [1:862] 1 0 NA 1 0 0 1 1 0 0 ...
$ bta_u_classification_U3 : num [1:862] 0 1 NA 0 1 1 0 0 1 1 ...
$ bta_u_classification_U4 : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
$ bta_u_classification_U5 : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
$ solitary_nodule_Yes : num [1:862] 0 1 0 0 0 1 0 0 1 0 ...
$ cervical_lymphadenopathy_Yes: num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
>
We can then try fitting a model using this recipe with a higher threshold.
thyroid_recipe_investigate_missing_threshold<- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_filter_missing(recipes::all_predictors(), threshold = 0.1) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors())
thyroid_workflow <- workflows::workflow() |>
workflows::add_recipe(thyroid_recipe_investigate_missing_threshold)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
parsnip::set_engine("glmnet")
## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
resamples = cv_folds,
grid = dials::grid_regular(penalty(), levels = 50)
)
## K-fold best fit for LASSO
lasso_kfold_roc_auc <- lasso_grid |>
tune::select_best(metric = "roc_auc")
+ + + > + > > + > > + + + + → A | warning: ! There are new levels in `ethnicity`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
→ B | error: x has missing values; consider using makeX() to impute them
→ C | warning: ! There are new levels in `ethnicity`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `exposure_radiation`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
→ D | warning: ! There are new levels in `ethnicity`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hypertension`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `exposure_radiation`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
→ E | warning: ! There are new levels in `ethnicity`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hypertension`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
There were issues with some computations A: x51 B: x100 C: x20 D: x19 E: x10
> > + Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Not great, !All models failed
:cry:
But we have some useful information on things we can try recipes::step_unknown()
could be useful here and we could perhaps be smarter about what predictor variables we are including in the model by leveraging recipes::step_corr()
to remove correlated variables, perhaps preferentially retaining those with the most complete data to minimise the impact of missing data.
recipes::step_corr()
Lets start by removing recipes::step_filter_missing()
and look at what normalisation and dummy steps do...
recipes::step_normalize()
and recipes::step_dummy()
split <- df_complete|>
rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
## Without removing anything what doe we have
thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::prep() |>
recipes::bake(new_data = NULL)
str(thyroid_recipe)
+ + + + > tibble [862 × 46] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.132 -0.832 -2.115 -1.124 -0.657 ...
..- attr(*, "label")= chr "Age"
$ albumin : 'labelled' num [1:862] NA NA -0.759 NA -0.759 ...
..- attr(*, "label")= chr "Albumin"
$ tsh_value : 'labelled' num [1:862] NA NA NA -0.25 -0.143 ...
..- attr(*, "label")= chr "TSH value"
$ lymphocytes : 'labelled' num [1:862] NA NA -1.254 0.314 0.273 ...
..- attr(*, "label")= chr "Lymphocytes"
$ monocyte : 'labelled' num [1:862] NA NA -0.279 -0.33 0.285 ...
..- attr(*, "label")= chr "Monocytes"
$ size_nodule_mm : 'labelled' num [1:862] -0.8169 0.0526 3.7788 NA -1.1398 ...
..- attr(*, "label")= chr "Nodule size (mm)"
$ final_pathology : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 1 1 1 1 1 ...
$ gender_M : num [1:862] 0 0 0 0 1 0 0 0 0 1 ...
$ ethnicity_B : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_C : num [1:862] 0 0 0 0 0 1 0 0 0 0 ...
$ ethnicity_D : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_F : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_G : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_H : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_J : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_K : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_L : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_M : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_N : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_P : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_R : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_S : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_Z : num [1:862] 0 0 0 1 1 0 0 0 0 0 ...
$ incidental_nodule_Yes : num [1:862] 0 0 0 1 1 0 0 1 0 1 ...
$ palpable_nodule_Yes : num [1:862] 1 NA 1 0 0 0 1 0 1 0 ...
$ rapid_enlargement_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ compressive_symptoms_Yes : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
$ hypertension_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ vocal_cord_paresis_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ graves_disease_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ hashimotos_thyroiditis_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 NA 0 NA 0 0 0 0 ...
$ exposure_radiation_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_U2 : num [1:862] 1 1 1 1 1 1 1 1 1 0 ...
$ bta_u_classification_U3 : num [1:862] 0 0 0 0 0 0 0 0 0 1 ...
$ bta_u_classification_U4 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_U5 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ solitary_nodule_Yes : num [1:862] 0 0 1 0 1 0 0 0 1 0 ...
$ cervical_lymphadenopathy_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy1c : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy2 : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy2c : num [1:862] NA NA 1 NA NA NA NA NA NA NA ...
$ thy_classification_Thy3a : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy3f : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy4 : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy5 : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
>
recipes::step_corr(threshold = 0.9)
Now add in the recipes_step_corr(threshold = 0.9)
thyroid_recipe_step_corr <- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::step_corr(all_predictors(), threshold = 0.9) |>
recipes::prep() |>
recipes::bake(new_data = NULL)
str(thyroid_recipe_step_corr)
> thyroid_recipe_step_corr <- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::step_corr(all_predictors(), threshold = 0.9) |>
recipes::prep() |>
recipes::bake(new_data = NULL)
str(thyroid_recipe_step_corr)
+ + + + + > tibble [862 × 45] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.132 -0.832 -2.115 -1.124 -0.657 ...
..- attr(*, "label")= chr "Age"
$ albumin : 'labelled' num [1:862] NA NA -0.759 NA -0.759 ...
..- attr(*, "label")= chr "Albumin"
$ tsh_value : 'labelled' num [1:862] NA NA NA -0.25 -0.143 ...
..- attr(*, "label")= chr "TSH value"
$ lymphocytes : 'labelled' num [1:862] NA NA -1.254 0.314 0.273 ...
..- attr(*, "label")= chr "Lymphocytes"
$ monocyte : 'labelled' num [1:862] NA NA -0.279 -0.33 0.285 ...
..- attr(*, "label")= chr "Monocytes"
$ size_nodule_mm : 'labelled' num [1:862] -0.8169 0.0526 3.7788 NA -1.1398 ...
..- attr(*, "label")= chr "Nodule size (mm)"
$ final_pathology : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 1 1 1 1 1 ...
$ gender_M : num [1:862] 0 0 0 0 1 0 0 0 0 1 ...
$ ethnicity_B : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_C : num [1:862] 0 0 0 0 0 1 0 0 0 0 ...
$ ethnicity_D : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_F : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_G : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_H : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_J : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_K : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_L : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_M : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_N : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_P : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_R : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_S : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_Z : num [1:862] 0 0 0 1 1 0 0 0 0 0 ...
$ incidental_nodule_Yes : num [1:862] 0 0 0 1 1 0 0 1 0 1 ...
$ palpable_nodule_Yes : num [1:862] 1 NA 1 0 0 0 1 0 1 0 ...
$ rapid_enlargement_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ compressive_symptoms_Yes : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
$ hypertension_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ vocal_cord_paresis_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ graves_disease_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ hashimotos_thyroiditis_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 NA 0 NA 0 0 0 0 ...
$ exposure_radiation_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_U3 : num [1:862] 0 0 0 0 0 0 0 0 0 1 ...
$ bta_u_classification_U4 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_U5 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ solitary_nodule_Yes : num [1:862] 0 0 1 0 1 0 0 0 1 0 ...
$ cervical_lymphadenopathy_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy1c : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy2 : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy2c : num [1:862] NA NA 1 NA NA NA NA NA NA NA ...
$ thy_classification_Thy3a : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy3f : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy4 : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
$ thy_classification_Thy5 : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
>
What is different? We can use the waldo::compare()
package and function to tell us quickly if these are the same.
install.packages("waldo")
waldo::compare(thyroid_recipe, thyroid_recipe_step_corr)
`old` is length 46
`new` is length 45
names(old) | names(new)
[31] "hashimotos_thyroiditis_Yes" | "hashimotos_thyroiditis_Yes" [31]
[32] "family_history_thyroid_cancer_Yes" | "family_history_thyroid_cancer_Yes" [32]
[33] "exposure_radiation_Yes" | "exposure_radiation_Yes" [33]
[34] "bta_u_classification_U2" -
[35] "bta_u_classification_U3" | "bta_u_classification_U3" [34]
[36] "bta_u_classification_U4" | "bta_u_classification_U4" [35]
[37] "bta_u_classification_U5" | "bta_u_classification_U5" [36]
`old$bta_u_classification_U2` is a double vector (1, 1, 1, 1, 1, ...)
`new$bta_u_classification_U2` is absent
...which tells us that one level of bta_u_classification
(the U2
level) has been dropped by by recipes::step_corr()
.
Ok, not too much benefit to doing this but that is considerably quicker than comparing by eye and trying to work out what is different.
recipes::step_unknown()
Lets now investigate the effect of using recipes::step_unknown()
on our recipe. This step needs to be run before recipes::step_dummy()
and only works on nominal/binary/factor variables.
We again use waldo::compare()
to see how the resulting data set differs from the thyroid_recipe
from the above step which normalises and creates dummy variables.
thyroid_recipe_step_unknown <- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_unknown(recipes::all_nominal_predictors(), new_level = "unknown") |>
recipes::step_dummy(recipes::all_nominal_predictors()) |>
recipes::prep() |>
recipes::bake(new_data = NULL)
str(thyroid_recipe_step_unknown)
+ + + + + > tibble [862 × 62] (S3: tbl_df/tbl/data.frame)
$ age_at_scan : 'labelled' num [1:862] -0.132 -0.832 -2.115 -1.124 -0.657 ...
..- attr(*, "label")= chr "Age"
$ albumin : 'labelled' num [1:862] NA NA -0.759 NA -0.759 ...
..- attr(*, "label")= chr "Albumin"
$ tsh_value : 'labelled' num [1:862] NA NA NA -0.25 -0.143 ...
..- attr(*, "label")= chr "TSH value"
$ lymphocytes : 'labelled' num [1:862] NA NA -1.254 0.314 0.273 ...
..- attr(*, "label")= chr "Lymphocytes"
$ monocyte : 'labelled' num [1:862] NA NA -0.279 -0.33 0.285 ...
..- attr(*, "label")= chr "Monocytes"
$ size_nodule_mm : 'labelled' num [1:862] -0.8169 0.0526 3.7788 NA -1.1398 ...
..- attr(*, "label")= chr "Nodule size (mm)"
$ final_pathology : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 1 1 1 1 1 ...
$ gender_M : num [1:862] 0 0 0 0 1 0 0 0 0 1 ...
$ gender_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_B : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_C : num [1:862] 0 0 0 0 0 1 0 0 0 0 ...
$ ethnicity_D : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_F : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_G : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_H : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_J : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_K : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_L : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_M : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_N : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_P : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_R : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_S : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ ethnicity_Z : num [1:862] 0 0 0 1 1 0 0 0 0 0 ...
$ ethnicity_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ incidental_nodule_Yes : num [1:862] 0 0 0 1 1 0 0 1 0 1 ...
$ incidental_nodule_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ palpable_nodule_Yes : num [1:862] 1 0 1 0 0 0 1 0 1 0 ...
$ palpable_nodule_unknown : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
$ rapid_enlargement_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ rapid_enlargement_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ compressive_symptoms_Yes : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
$ compressive_symptoms_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ hypertension_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ hypertension_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ vocal_cord_paresis_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ vocal_cord_paresis_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ graves_disease_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ graves_disease_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ hashimotos_thyroiditis_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ hashimotos_thyroiditis_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ family_history_thyroid_cancer_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ family_history_thyroid_cancer_unknown: num [1:862] 0 0 0 1 0 1 0 0 0 0 ...
$ exposure_radiation_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ exposure_radiation_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_U2 : num [1:862] 1 1 1 1 1 1 1 1 1 0 ...
$ bta_u_classification_U3 : num [1:862] 0 0 0 0 0 0 0 0 0 1 ...
$ bta_u_classification_U4 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_U5 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ bta_u_classification_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ solitary_nodule_Yes : num [1:862] 0 0 1 0 1 0 0 0 1 0 ...
$ solitary_nodule_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ cervical_lymphadenopathy_Yes : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ cervical_lymphadenopathy_unknown : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy1c : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy2 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy2c : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
$ thy_classification_Thy3a : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy3f : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy4 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_Thy5 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
$ thy_classification_unknown : num [1:862] 1 1 0 1 1 1 1 1 1 1 ...
> > >
And comparing them we see...
## Compare the packages
waldo::compare(thyroid_recipe, thyroid_recipe_step_unknown)
> waldo::compare(thyroid_recipe, thyroid_recipe_step_unknown)
`old` is length 46
`new` is length 62
`names(old)[6:11]`: "size_nodule_mm" "final_pathology" "gender_M" "ethnicity_B" "ethnicity_C" "ethnicity_D"
`names(new)[6:12]`: "size_nodule_mm" "final_pathology" "gender_M" "gender_unknown" "ethnicity_B" "ethnicity_C" "ethnicity_D"
names(old) | names(new)
[21] "ethnicity_R" | "ethnicity_R" [22]
[22] "ethnicity_S" | "ethnicity_S" [23]
[23] "ethnicity_Z" | "ethnicity_Z" [24]
- "ethnicity_unknown" [25]
[24] "incidental_nodule_Yes" | "incidental_nodule_Yes" [26]
- "incidental_nodule_unknown" [27]
[25] "palpable_nodule_Yes" | "palpable_nodule_Yes" [28]
- "palpable_nodule_unknown" [29]
[26] "rapid_enlargement_Yes" | "rapid_enlargement_Yes" [30]
- "rapid_enlargement_unknown" [31]
... ... ... and 26 more ...
`names(old)[44:46]`: "thy_classification_Thy3f" "thy_classification_Thy4" "thy_classification_Thy5"
`names(new)[59:62]`: "thy_classification_Thy3f" "thy_classification_Thy4" "thy_classification_Thy5" "thy_classification_unknown"
`old$ethnicity_B[38:44]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[38:44]`: 0 0 0 0 0 0 0
`old$ethnicity_B[48:54]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[48:54]`: 0 0 0 0 0 0 0
`old$ethnicity_B[58:64]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[58:64]`: 0 0 0 0 0 0 0
`old$ethnicity_B[110:116]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[110:116]`: 0 0 0 0 0 0 0
`old$ethnicity_B[133:139]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[133:139]`: 0 0 0 0 0 0 0
`old$ethnicity_B[171:183]`: 0 0 0 NA 0 0 0 0 0 NA and 3 more...
`new$ethnicity_B[171:183]`: 0 0 0 0 0 0 0 0 0 0 ...
And 969 more differences ...
There are a bunch of new dummy variables as there are now 62 variables in thyroid_recipe_step_unknown
compared to only 46 in thyroid_recipe
and these are all columns of dummy variables with _unknown
appended.
Its not particularly useful to have an additional unknown
category in any of our variables because if such categories end up being predictive of outcome its not very informative. In clinic would you not do a test or assessment to increase your predictive accuracy? No you wouldn't!
But lets have a go at fitting a model anyway...
> thyroid_recipe_step_unknown <- recipes::recipe(final_pathology ~ ., data = train) |>
recipes::step_normalize(recipes::all_numeric_predictors()) |>
recipes::step_unknown(recipes::all_nominal_predictors(), new_level = "unknown") |>
recipes::step_dummy(recipes::all_nominal_predictors())
thyroid_workflow <- workflows::workflow() |>
workflows::add_recipe(thyroid_recipe_step_unknown)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
parsnip::set_engine("glmnet")
## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
resamples = cv_folds,
grid = dials::grid_regular(penalty(), levels = 50)
)
+ + + > + > > + > > + + + + → A | error: x has missing values; consider using makeX() to impute them
There were issues with some computations A: x100
>
Oh dear, there were errors with all of the fitted models because we still have missing data in the continuous variables.
These are parameters such as albumin
and the other blood assays. As this data set has been retrospectively completed it will be impossible to go back and get assays from those who did not have blood samples taken and assays performed.
There are a couple of options available at this point...
I have started work on imputing missing variables and as noted above this allows us to fit models. Personally I'm not a great fan of imputation its just filling in the gaps based on correlation and assumes that data is missing at random (MAR), which is very hard to assess and likely untrue (e.g. blood tests may be more likely on young people compared to old or there may be a bias for blood tests if imaging shows the thyroid nodule to be larger).
As it looks like we will be using imputed datasets it would be worthwhile working through the MICE Vignettes that cover important topics the first six or so articles cover.
It will be important for each method of imputation used to combine the inferences and get an average, these different imputation methods can then be compared to each other.
Sensitivity analysis will be important too.
Closing as its clear this issue is down to missing data being removed by the recipes::step_*()
stages prior to running workflows.
Strategy is to use imputed datasets but this introduces additional work to average models (see #32)
In the current workflow where we select all variables to be included we find that on fitting models (
glm()
/glmnet()
/ etc.) onlyage_at_scan
andgender
are retained as variables.Why is this?
Simple model
Start by taking a simplified model with less variables and test it with the Tidymodels approach to running logistic regression
We see that 4 observations are dropped due to "missingness". Lets add a few more variables in to our predictors.
Weird, neither
incidental_nodule
norpalpable_nodule
have coefficients returned, lets checkpalpable_nodule
on its own.There is an effect there for
palpable_nodule
although 43 observations are dropped (NB no assessment of significance is made at this point, confidence intervals may be very wide around that point estimate).Missing data patterns
Lets check the patterns of missing data using visdat
This is a useful plot, it shows the pattern of missing data across the selected variables and of particular note is that
thyroid_classification
is missing for almost everyone, we will have to do one of two things, either exclude this variable or derive a category for those with missing. I'm not sure but suspect this is only available for those who have had biopsies and/or surgery.