ns-rse / sheffield-thyroid

http://blog.nshephard.dev/sheffield-thyroid/
GNU General Public License v3.0
0 stars 0 forks source link

Investigate why recipes/workflows only retain age_at_scan and gender #25

Closed ns-rse closed 1 month ago

ns-rse commented 1 month ago

In the current workflow where we select all variables to be included we find that on fitting models (glm() / glmnet() / etc.) only age_at_scan and gender are retained as variables.

Why is this?

Simple model

Start by taking a simplified model with less variables and test it with the Tidymodels approach to running logistic regression

## Recipe
thyroid_recipe <- recipes::recipe(final_pathology ~ incidental_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_thyroid_workflow <- thyroid_workflow |>
  add_model(logistic_model)
## Fit
log_thyroid_fit <- fit(log_thyroid_workflow, data = train)
log_thyroid_fit
 ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
          (Intercept)  incidental_nodule_Yes  
                -1.77                  -1.19  

Degrees of Freedom: 857 Total (i.e. Null);  856 Residual
  (4 observations deleted due to missingness)
Null Deviance:      532 
Residual Deviance: 508  AIC: 512

We see that 4 observations are dropped due to "missingness". Lets add a few more variables in to our predictors.

## Recipe
thyroid_recipe <- recipes::recipe(final_pathology ~ incidental_nodule + palpable_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_thyroid_workflow <- thyroid_workflow |>
  add_model(logistic_model)
## Fit
log_thyroid_fit <- fit(log_thyroid_workflow, data = train)
log_thyroid_fit
══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)  
      -2.28  

Degrees of Freedom: 861 Total (i.e. Null);  861 Residual
Null Deviance:      533 
Residual Deviance: 533  AIC: 535

Weird, neither incidental_nodule nor palpable_nodule have coefficients returned, lets check palpable_nodule on its own.

## Recipe
thyroid_recipe <- recipes::recipe(final_pathology ~ palpable_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_thyroid_workflow <- thyroid_workflow |>
  add_model(logistic_model)
## Fit
log_thyroid_fit <- fit(log_thyroid_workflow, data = train)
log_thyroid_fit
══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
        (Intercept)  palpable_nodule_Yes  
              -3.26                 1.74  

Degrees of Freedom: 818 Total (i.e. Null);  817 Residual
  (43 observations deleted due to missingness)
Null Deviance:      511 
Residual Deviance: 464  AIC: 468

There is an effect there for palpable_nodule although 43 observations are dropped (NB no assessment of significance is made at this point, confidence intervals may be very wide around that point estimate).

Missing data patterns

Lets check the patterns of missing data using visdat

visdat::vis_miss(train) 

image

This is a useful plot, it shows the pattern of missing data across the selected variables and of particular note is that thyroid_classification is missing for almost everyone, we will have to do one of two things, either exclude this variable or derive a category for those with missing. I'm not sure but suspect this is only available for those who have had biopsies and/or surgery.

ns-rse commented 1 month ago

I've been doing some more investigations and have found that there are potential issues with missing data for some of the variables that may be causing problems.

I've used the following script based on logistic regression models that @mdp21oe had written

NB - In order to be comparable with the recipe/workflow that we are investigating I have changed the data being used to train instead of the full raw (but tidy) dataset df. This later (i.e. df) includes 214 individuals who are missing final_pathology and can not be included in the analysis anyway since there is no assessment of their thyroid cancer status. Further we want to work out why the various Tidymodelling tuning doesn't work and that uses the train subset so we need to compare like for like.

Setup

Load libraries and data, remove those with missing final_pathology and split into train and test.

library(dplyr)
library(Hmisc)
library(knitr)
library(readr)
library(rmarkdown)

## Libraries for Tidymodelling
library(dials)
library(kernlab)
library(knitr)
library(tidymodels)
library(tidyverse)
library(vip)

## Load and retain only those with final_pathology
df <- readRDS(paste(r_dir, "clean.rds", sep = "/"))
df_complete <- df |>
  dplyr::select(
    age_at_scan,
    gender,
    ethnicity,
    incidental_nodule,
    palpable_nodule,
    rapid_enlargement,
    compressive_symptoms,
    hypertension,
    vocal_cord_paresis,
    graves_disease,
    hashimotos_thyroiditis,
    family_history_thyroid_cancer,
    exposure_radiation,
    albumin,
    tsh_value,
    lymphocytes,
    monocyte,
    bta_u_classification,
    solitary_nodule,
    size_nodule_mm,
    cervical_lymphadenopathy,
    thy_classification,
    final_pathology) |>
  dplyr::filter(!is.na(final_pathology))

## Split data into train/test
split <- rsample::initial_split(df_complete, prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)

## Setup cross-validation with k-folds (n = 10)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)

Run glm() on subsets of variables

Clinical variables

## Clinical variables
glm_clin <- glm(final_pathology ~ age_at_scan + gender + incidental_nodule + palpable_nodule +
    rapid_enlargement + compressive_symptoms + hashimotos_thyroiditis + family_history_thyroid_cancer + exposure_radiation +
    tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy,
    data = train,
    family = binomial(link = "logit"))
summary(glm_clin)

Call:
glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule + 
    palpable_nodule + rapid_enlargement + compressive_symptoms + 
    hashimotos_thyroiditis + family_history_thyroid_cancer + 
    exposure_radiation + tsh_value + size_nodule_mm + solitary_nodule + 
    cervical_lymphadenopathy, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -2.004e+00  7.626e-01  -2.627 0.008608 ** 
age_at_scan                      -2.977e-02  1.171e-02  -2.542 0.011026 *  
genderM                           7.614e-01  4.115e-01   1.850 0.064303 .  
incidental_noduleYes             -1.308e-01  4.969e-01  -0.263 0.792384    
palpable_noduleYes               -1.011e-01  5.328e-01  -0.190 0.849506    
rapid_enlargementYes              5.313e-03  9.793e-01   0.005 0.995671    
compressive_symptomsYes           1.170e-01  5.556e-01   0.211 0.833266    
hashimotos_thyroiditisYes        -1.459e+01  2.400e+03  -0.006 0.995149    
family_history_thyroid_cancerYes  1.701e+01  2.400e+03   0.007 0.994343    
exposure_radiationYes            -1.489e+01  1.671e+03  -0.009 0.992889    
tsh_value                         1.336e-01  7.379e-02   1.811 0.070200 .  
size_nodule_mm                    4.478e-02  1.174e-02   3.816 0.000135 ***
solitary_noduleYes                6.990e-01  3.594e-01   1.945 0.051776 .  
cervical_lymphadenopathyYes       3.017e+00  9.198e-01   3.280 0.001038 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 286.13  on 325  degrees of freedom
Residual deviance: 223.54  on 312  degrees of freedom
  (536 observations deleted due to missingness)
AIC: 251.54

Number of Fisher Scoring iterations: 15

glm_clin

Call:  glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule + 
    palpable_nodule + rapid_enlargement + compressive_symptoms + 
    hashimotos_thyroiditis + family_history_thyroid_cancer + 
    exposure_radiation + tsh_value + size_nodule_mm + solitary_nodule + 
    cervical_lymphadenopathy + incidental_nodule + tsh_value + 
    size_nodule_mm + solitary_nodule + bta_u_classification, 
    family = binomial(link = "logit"), data = train)

Coefficients:
                     (Intercept)                       age_at_scan                           genderM  
                        -2.98963                          -0.02312                           0.61605  
            incidental_noduleYes                palpable_noduleYes              rapid_enlargementYes  
                        -0.43710                          -0.44385                          -0.18599  
         compressive_symptomsYes         hashimotos_thyroiditisYes  family_history_thyroid_cancerYes  
                         0.22305                         -16.28615                          16.10881  
           exposure_radiationYes                         tsh_value                    size_nodule_mm  
                       -17.31802                           0.15792                           0.03851  
              solitary_noduleYes       cervical_lymphadenopathyYes            bta_u_classificationU3  
                         0.77457                           2.03330                           1.94868  
          bta_u_classificationU4            bta_u_classificationU5  
                         3.12813                          17.74532  

Degrees of Freedom: 312 Total (i.e. Null);  296 Residual
  (549 observations deleted due to missingness)  <<<<<<<<< IMPORTANT!!!
Null Deviance:      275 
Residual Deviance: 180.3    AIC: 214.3

549 observations dropped due to missing data.

Biochemical markers

glm_biochem <- glm(final_pathology ~ age_at_scan + gender + tsh_value + albumin + lymphocytes + monocyte,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_biochem)

Call:
glm(formula = final_pathology ~ age_at_scan + gender + tsh_value + 
    albumin + lymphocytes + monocyte, family = binomial(link = "logit"), 
    data = train)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.73160    2.67009  -0.649 0.516652    
age_at_scan -0.03118    0.01122  -2.780 0.005441 ** 
genderM      1.34973    0.39931   3.380 0.000725 ***
tsh_value    0.11219    0.07189   1.561 0.118623    
albumin      0.04716    0.05323   0.886 0.375708    
lymphocytes  0.27508    0.26067   1.055 0.291292    
monocyte    -3.82506    1.27730  -2.995 0.002748 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 264.17  on 347  degrees of freedom
Residual deviance: 227.00  on 341  degrees of freedom
  (514 observations deleted due to missingness)
AIC: 241

Number of Fisher Scoring iterations: 6

glm_biochem

Call:  glm(formula = final_pathology ~ age_at_scan + gender + tsh_value + 
    albumin + lymphocytes + monocyte, family = binomial(link = "logit"), 
    data = train)

Coefficients:
(Intercept)  age_at_scan      genderM    tsh_value      albumin  lymphocytes     monocyte  
   -1.73160     -0.03118      1.34973      0.11219      0.04716      0.27508     -3.82506  

Degrees of Freedom: 347 Total (i.e. Null);  341 Residual
  (514 observations deleted due to missingness)  <<<<<<<<< IMPORTANT!!!
Null Deviance:      264.2 
Residual Deviance: 227  AIC: 241

514 observations deleted due to missing data.

Ultrasound features

glm_ultrasound <- glm(
    final_pathology ~ age_at_scan + gender + incidental_nodule + tsh_value + size_nodule_mm +
        solitary_nodule + cervical_lymphadenopathy + bta_u_classification, # + thy_classification,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_ultrasound)
Call:
glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule + 
    tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy + 
    bta_u_classification, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -2.92562    0.67664  -4.324 1.53e-05 ***
age_at_scan                   -0.02624    0.01140  -2.302  0.02135 *  
genderM                        0.62732    0.40513   1.548  0.12152    
incidental_noduleYes          -0.35053    0.38496  -0.911  0.36252    
tsh_value                      0.14586    0.07193   2.028  0.04257 *  
size_nodule_mm                 0.03141    0.01068   2.940  0.00328 ** 
solitary_noduleYes             0.32866    0.36267   0.906  0.36482    
cervical_lymphadenopathyYes    0.64249    0.88127   0.729  0.46597    
bta_u_classificationU3         1.98761    0.38883   5.112 3.19e-07 ***
bta_u_classificationU4         3.15909    0.69606   4.539 5.67e-06 ***
bta_u_classificationU5        18.13639 1164.29414   0.016  0.98757    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 329.11  on 404  degrees of freedom
Residual deviance: 225.02  on 394  degrees of freedom
  (457 observations deleted due to missingness)
AIC: 247.02

Number of Fisher Scoring iterations: 15

glm_ultrasound

Call:  glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule + 
    tsh_value + size_nodule_mm + solitary_nodule + cervical_lymphadenopathy + 
    bta_u_classification, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                (Intercept)                  age_at_scan                      genderM         incidental_noduleYes  
                   -2.92562                     -0.02624                      0.62732                     -0.35053  
                  tsh_value               size_nodule_mm           solitary_noduleYes  cervical_lymphadenopathyYes  
                    0.14586                      0.03141                      0.32866                      0.64249  
     bta_u_classificationU3       bta_u_classificationU4       bta_u_classificationU5  
                    1.98761                      3.15909                     18.13639  

Degrees of Freedom: 404 Total (i.e. Null);  394 Residual
  (457 observations deleted due to missingness)  <<<<<<<<< IMPORTANT!!!
Null Deviance:      329.1 
Residual Deviance: 225  AIC: 247

457 observations dropped due to missing data.

Saturated model

In the Tidymodeling we are attempting to use a saturated model with all variables because as can be seen from above the coefficients vary wildly depending on the other co-variates. If we look at the values for age_at_scan and gender which are included in all and they are bouncing around all over the place. What we want is a fully adjusted estimate for each co-variate taking into account all the other variables

glm_all <- glm(
    final_pathology ~ age_at_scan + gender + incidental_nodule + palpable_nodule +
        rapid_enlargement + compressive_symptoms + hashimotos_thyroiditis + family_history_thyroid_cancer + exposure_radiation +
        tsh_value +
        size_nodule_mm +
        solitary_nodule +
        cervical_lymphadenopathy +
        albumin +
        lymphocytes +
        monocyte +
        incidental_nodule +
        tsh_value +
        size_nodule_mm +
        solitary_nodule +
        bta_u_classification,
    data = train,
    family = binomial(link = "logit")
)
glm_all <- glm(
+     final_pathology ~ age_at_scan +
+         gender +
+         incidental_nodule +
+         palpable_nodule +
+         rapid_enlargement +
+         compressive_symptoms +
+         hashimotos_thyroiditis +
+         family_history_thyroid_cancer +
+         exposure_radiation +
+         tsh_value +
+         size_nodule_mm +
+         solitary_nodule +
+         cervical_lymphadenopathy +
+         albumin +
+         lymphocytes +
+         monocyte +
+         incidental_nodule +
+         tsh_value +
+         size_nodule_mm +
+         solitary_nodule +
+         bta_u_classification,
+     data = train,
+     family = binomial(link = "logit")
+ )
+ + + + + + + + + + + + + + + + + + + + + + + + Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

Oh no our model didn't fit we've got a weird error cropping up here?!?!?

Looking at the pattern of missing data (see above graph and tables in current rendering of index.qmd) we see the variables with the most missing data are...

Variable N %
thy_classification 915 79.6
albumin 515 44.8
tsh_value 413 35.9
monocyte 363 31.6
lymphocytes 359 31.2
size_nodule_mm 319 27.7
family_history_thyroid_cancer 281 24.4

We have already excluded thy_classification in the above but there are some others with very high levels of missing data, namely the biochemical markers obtained from blood assays albumin, lymphocytes, monocyte and tsh_value.

If individuals are excluded based on these and it leaves a individuals who for one of the other binary variables are invariant, e.g. exposure_radiation where only 6 individuals are Yes and the rest are No then the glm() model will complain with the above message that contrasts can be applied only to factors with 2 or more levels because we have to have variation in our predictor for it to be used in the model (if the subset that remains after dropping out those with missing across all factors are all exposure_radiation = No then we will see this sort of error, although it's worth noting I've not investigated whether this is the problem variable).

But the biochemical markers are also problematic because they have so much missing data. If we exclude these biochemical markers and the exposure_radiation too for good measure as there are so few people with Yes we can fit a saturated model.

## Saturated model
glm_all <- glm(
    final_pathology ~ age_at_scan +
        gender +
        incidental_nodule +
        palpable_nodule +
        rapid_enlargement +
        compressive_symptoms +
        hashimotos_thyroiditis +
        family_history_thyroid_cancer +
        ## exposure_radiation +     <<< We can easily exclude variables by laying out code in
        ## tsh_value +              <<< this manner and adding comments to start of lines ;-)
        size_nodule_mm +
        solitary_nodule +
        cervical_lymphadenopathy +
        ## albumin +
        ## lymphocytes +
        ## monocyte +
        incidental_nodule +
        size_nodule_mm +
        solitary_nodule +
        bta_u_classification,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_all)

Call:
glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule + 
    palpable_nodule + rapid_enlargement + compressive_symptoms + 
    hashimotos_thyroiditis + family_history_thyroid_cancer + 
    size_nodule_mm + solitary_nodule + cervical_lymphadenopathy + 
    incidental_nodule + size_nodule_mm + solitary_nodule + bta_u_classification, 
    family = binomial(link = "logit"), data = train)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -2.75668    0.75789  -3.637 0.000276 ***
age_at_scan                        -0.02455    0.01143  -2.147 0.031782 *  
genderM                             0.49636    0.40278   1.232 0.217822    
incidental_noduleYes               -0.61578    0.49288  -1.249 0.211540    
palpable_noduleYes                 -0.18385    0.51993  -0.354 0.723637    
rapid_enlargementYes               -0.27294    1.08007  -0.253 0.800497    
compressive_symptomsYes            -0.12914    0.57947  -0.223 0.823640    
hashimotos_thyroiditisYes         -15.60853 2399.54500  -0.007 0.994810    
family_history_thyroid_cancerYes    0.82812    2.04767   0.404 0.685904    
size_nodule_mm                      0.03477    0.01109   3.136 0.001711 ** 
solitary_noduleYes                  0.67604    0.36639   1.845 0.065018 .  
cervical_lymphadenopathyYes         1.32420    0.98795   1.340 0.180132    
bta_u_classificationU3              2.16334    0.39294   5.506 3.68e-08 ***
bta_u_classificationU4              3.41627    0.75777   4.508 6.53e-06 ***
bta_u_classificationU5             18.92801  954.07210   0.020 0.984172    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 338.70  on 437  degrees of freedom
Residual deviance: 219.56  on 423  degrees of freedom
  (424 observations deleted due to missingness)
AIC: 249.56

Number of Fisher Scoring iterations: 15

glm_all

> 
Call:  glm(formula = final_pathology ~ age_at_scan + gender + incidental_nodule + 
    palpable_nodule + rapid_enlargement + compressive_symptoms + 
    hashimotos_thyroiditis + family_history_thyroid_cancer + 
    size_nodule_mm + solitary_nodule + cervical_lymphadenopathy + 
    incidental_nodule + size_nodule_mm + solitary_nodule + bta_u_classification, 
    family = binomial(link = "logit"), data = train)

Coefficients:
                     (Intercept)                       age_at_scan                           genderM  
                        -2.75668                          -0.02455                           0.49636  
            incidental_noduleYes                palpable_noduleYes              rapid_enlargementYes  
                        -0.61578                          -0.18385                          -0.27294  
         compressive_symptomsYes         hashimotos_thyroiditisYes  family_history_thyroid_cancerYes  
                        -0.12914                         -15.60853                           0.82812  
                  size_nodule_mm                solitary_noduleYes       cervical_lymphadenopathyYes  
                         0.03477                           0.67604                           1.32420  
          bta_u_classificationU3            bta_u_classificationU4            bta_u_classificationU5  
                         2.16334                           3.41627                          18.92801  

Degrees of Freedom: 437 Total (i.e. Null);  423 Residual
  (424 observations deleted due to missingness)
Null Deviance:      338.7 
Residual Deviance: 219.6    AIC: 249.6

424 observations are still dropped due to missing data though!!!

This hasn't resolved all of the problems but is something to bear in mind.

I've asked a question on the Data Science Learning Slack channel (specifically in the #help-r-model channel, but the first link in this paragraph will take you there, although you will need to sign up to the Slack channel first @mdp21oe).

ns-rse commented 1 month ago

Copy of question and perhaps clearer workings of a question I posted to the Data Science Learning Community Slack channel (need to sign up to the Slack community to view hence posting here for reference).

Hi,

I'm trying to wrangle a strange problem I've encountered trying to use the Tidymodels framework. If anyone has any thoughts or ideas as to what might be going on I would be very grateful.

Data

Data pertains to Thyroid cancer (final_pathology) and there are a bunch of covariates and large amounts of missing data but I can demonstrate the problem with just two co-variates incidental_nodule and palpable_nodule.

I can fit models for each of these covariates independently but not together.

Recipe and Workflow

Import the data, remove those with missing final_pathology and split into train/test

df <- readRDS(paste(r_dir, "clean.rds", sep = "/"))
df_complete <- df |>
  dplyr::select(
    incidental_nodule,
    palpable_nodule,
    final_pathology) |>
  dplyr::filter(!is.na(final_pathology))

## Split data into train/test
split <- rsample::initial_split(df_complete, prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)

Incidental Nodule

glm()

First fit using glm()...

glm_incidental <- glm(
    final_pathology ~ incidental_nodule,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_incidental)
glm_incidental
Call:
glm(formula = final_pathology ~ incidental_nodule, family = binomial(link = "logit"),
    data = train)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)           -1.7800     0.1432 -12.431  < 2e-16 ***
incidental_noduleYes  -1.1287     0.2538  -4.446 8.73e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 536.64  on 858  degrees of freedom
Residual deviance: 514.94  on 857  degrees of freedom
  (3 observations deleted due to missingness)
AIC: 518.94

Number of Fisher Scoring iterations: 5

>
Call:  glm(formula = final_pathology ~ incidental_nodule, family = binomial(link = "logit"),
    data = train)

Coefficients:
         (Intercept)  incidental_noduleYes
              -1.780                -1.129

Degrees of Freedom: 858 Total (i.e. Null);  857 Residual
  (3 observations deleted due to missingness)
Null Deviance:      536.6
Residual Deviance: 514.9    AIC: 518.9

Tidymodels

Now fit with Tidymodels...

recipe_incidental <- recipes::recipe(final_pathology ~ incidental_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_incidental <- workflows::workflow() |>
  workflows::add_recipe(recipe_incidental)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_incidental <- workflow_incidental |>
  add_model(logistic_model)
## Fit
log_incidental_fit <- fit(log_incidental, data = train)
log_incidental_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
          (Intercept)  incidental_nodule_Yes
               -1.780                 -1.129

Degrees of Freedom: 858 Total (i.e. Null);  857 Residual
  (3 observations deleted due to missingness)
Null Deviance:      536.6
Residual Deviance: 514.9    AIC: 518.9

Great we can fit a glm with both glm() and a Tidymodels recipe and workflow for `incidental_nodule** and we get the same coefficients fitted.

Palpable Nodule

glm()

glm_palpable <- glm(
    final_pathology ~ palpable_nodule,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_palpable)
glm_palpable

Call:
glm(formula = final_pathology ~ palpable_nodule, family = binomial(link = "logit"),
    data = train)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)
(Intercept)         -3.0975     0.2231 -13.887  < 2e-16 ***
palpable_noduleYes   1.5238     0.2663   5.721 1.06e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 514.94  on 817  degrees of freedom
Residual deviance: 477.51  on 816  degrees of freedom
  (44 observations deleted due to missingness)
AIC: 481.51

Number of Fisher Scoring iterations: 5

>
Call:  glm(formula = final_pathology ~ palpable_nodule, family = binomial(link = "logit"),
    data = train)

Coefficients:
       (Intercept)  palpable_noduleYes
            -3.098               1.524

Degrees of Freedom: 817 Total (i.e. Null);  816 Residual
  (44 observations deleted due to missingness)
Null Deviance:      514.9
Residual Deviance: 477.5    AIC: 481.5

Tidymodels

recipe_palpable <- recipes::recipe(final_pathology ~ palpable_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_palpable <- workflows::workflow() |>
  workflows::add_recipe(recipe_palpable)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_palpable <- workflow_palpable |>
  add_model(logistic_model)
## Fit
log_palpable_fit <- fit(log_palpable, data = train)
log_palpable_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
        (Intercept)  palpable_nodule_Yes
             -3.098                1.524

Degrees of Freedom: 817 Total (i.e. Null);  816 Residual
  (44 observations deleted due to missingness)
Null Deviance:      514.9
Residual Deviance: 477.5    AIC: 481.5

Again we have the same model fitted which is good and as I would hope for.

Incidental + Palpable

Now lets try both incidental_nodule and palpable_nodule in the model. Start with a basic call to glm() first.

glm()

glm_incidental_palpable <- glm(
    final_pathology ~ incidental + palpable_nodule,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_incidental_palpable)
glm_incidental_palpable

Call:
glm(formula = final_pathology ~ incidental_nodule + palpable_nodule,
    family = binomial(link = "logit"), data = train)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)           -2.7353     0.3646  -7.502 6.29e-14 ***
incidental_noduleYes  -0.4473     0.3708  -1.207  0.22762
palpable_noduleYes     1.2076     0.3692   3.271  0.00107 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 514.94  on 817  degrees of freedom
Residual deviance: 476.03  on 815  degrees of freedom
  (44 observations deleted due to missingness)
AIC: 482.03

Number of Fisher Scoring iterations: 5

>
Call:  glm(formula = final_pathology ~ incidental_nodule + palpable_nodule,
    family = binomial(link = "logit"), data = train)

Coefficients:
         (Intercept)  incidental_noduleYes    palpable_noduleYes
             -2.7353               -0.4473                1.2076

Degrees of Freedom: 817 Total (i.e. Null);  815 Residual
  (44 observations deleted due to missingness)
Null Deviance:      514.9
Residual Deviance: 476  AIC: 482

Great we can fit a model using two co-variates lets try the Tidymodels approach...

Tidymodels

recipe_incidental_palpable <- recipes::recipe(final_pathology ~ incidental_nodule + palpable_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_incidental_palpable <- workflows::workflow() |>
  workflows::add_recipe(recipe_incidental_palpable)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_incidental_palpable <- workflow_incidental_palpable |>
  add_model(logistic_model)
## Fit
log_incidental_palpable_fit <- fit(log_incidental_palpable, data = train)
log_incidental_palpable_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)
     -2.266

Degrees of Freedom: 861 Total (i.e. Null);  861 Residual
Null Deviance:      537.2
Residual Deviance: 537.2    AIC: 539.2

Oh no!!! Where have our estimate for the co-variates gone? We could fit a model using glm() directly but not using the Tidymodels recipe() + workflow() + model() + fit().

I'm baffled as to why this happens. If I substitute one of the other covariates (age_at_scan) it is retained but the incidental_nodule or palpable_nodule (or both) are dropped.

glm_age_palpable <- glm(
    final_pathology ~ age_at_scan+ palpable_nodule,
    data = train,
    family = binomial(link = "logit")
)
summary(glm_age_palpable)
glm_age_palpable

Call:  glm(formula = final_pathology ~ age_at_scan + palpable_nodule,
    family = binomial(link = "logit"), data = train)

Coefficients:
       (Intercept)         age_at_scan  palpable_noduleYes
          -1.75935            -0.02615             1.49668

Degrees of Freedom: 817 Total (i.e. Null);  815 Residual
  (44 observations deleted due to missingness)
Null Deviance:      514.9
Residual Deviance: 464.9    AIC: 470.9

Get coefficients for both age_at_scan and palpable_nodule but with Tidymodels the later goes missing.

recipe_age_palpable <- recipes::recipe(final_pathology ~ age_at_scan + palpable_nodule, data = train) |>
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
## Workflow
workflow_age_palpable <- workflows::workflow() |>
  workflows::add_recipe(recipe_age_palpable)
## Model
logistic_model <- logistic_reg() |>
  set_engine("glm")
log_age_palpable <- workflow_age_palpable |>
  add_model(logistic_model)
## Fit
log_age_palpable_fit <- fit(log_age_palpable, data = train)
log_age_palpable_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_filter_missing()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept)  age_at_scan
    -2.3489      -0.4525

Degrees of Freedom: 861 Total (i.e. Null);  860 Residual
Null Deviance:      537.2
Residual Deviance: 522.4    AIC: 526.4
>

I can see the degrees of freedom differ, presumably because palpable_nodule is being dropped.

ns-rse commented 1 month ago

I've been working on imputing missing values and have therefore decided to test whether we see the same strange behaviour on a dataset without any missing data (i.e. after imputation).

The following code imputes a single data set using the Predictive Mean Matching method implemented in mice.

It then attempts to fit a LASSO model using the TidyModels approach.

library(dplyr)
library(Hmisc)
library(knitr)
library(readr)
library(rmarkdown)

## Libraries for Tidymodelling
library(dials)
library(kernlab)
library(knitr)
library(mice)
library(stringr)
library(tidymodels)
library(tidyverse)
library(vip)

## Load and retain only those with final_pathology
df <- readRDS(paste(r_dir, "clean.rds", sep = "/"))
df_complete <- df |>
  dplyr::select(
    age_at_scan,
    gender,
    ethnicity,
    incidental_nodule,
    palpable_nodule,
    rapid_enlargement,
    compressive_symptoms,
    hypertension,
    vocal_cord_paresis,
    graves_disease,
    hashimotos_thyroiditis,
    family_history_thyroid_cancer,
    exposure_radiation,
    albumin,
    tsh_value,
    lymphocytes,
    monocyte,
    bta_u_classification,
    solitary_nodule,
    size_nodule_mm,
    cervical_lymphadenopathy,
    thy_classification,
    final_pathology) |>
  dplyr::filter(!is.na(final_pathology))

##########################################

## 2024-07-18 - Testing with an imputed data set
##
## Impute data
imputations = 1
cores = imputations
mice_pmm_single <- df_complete |>
    dplyr::select(-final_pathology) |> ## We deliberately exclude the outcome of interest
    mice::futuremice(m = imputations, method = "pmm", n.core = cores) |>
    mice::complete(action = "long", include = FALSE)
## Bind the final_pathology, removing the .id and .imp columns too and renaming columns
mice_pmm_single <- cbind(mice_pmm_single, df_complete$final_pathology) |>
    dplyr::select(-.id)
colnames(mice_pmm_single) <- stringr::str_replace(colnames(mice_pmm_single), "df_complete\\$", "")

## Split a single imputed data set into train/test
split <- mice_pmm_single |>
    dplyr::filter(.imp == 1) |>
    dplyr::select(-.imp) |>
    rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)

## Setup cross-validation with k-folds (n = 10)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)

## Setup a recipe and workflow
thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
  ## @ns-rse 2024-06-14 :
  ## This step can be used to filter observations with missing data, see the manual pages for more details
  ## https://recipes.tidymodels.org/reference/step_filter_missing.html
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  ## @ns-rse 2024-06-14 :
  ## We first normalise the data _before_ we generate dummies otherwise the dummies, which are numerical, get normalised
  ## too
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe)

## Try tuning a LASSO model
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
  parsnip::set_engine("glmnet")

## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
  object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
  resamples = cv_folds,
  grid = dials::grid_regular(penalty(), levels = 50)
)

## K-fold best fit for LASSO
lasso_kfold_roc_auc <- lasso_grid |>
  tune::select_best(metric = "roc_auc")

## Fit the final LASSO model
final_lasso_kfold <- tune::finalize_workflow(
  workflows::add_model(thyroid_workflow, tune_spec_lasso),
  lasso_kfold_roc_auc)

final_lasso_kfold |>
  fit(train) |>
  hardhat::extract_fit_parsnip() |>
  vip::vi(lambda = lasso_kfold_roc_auc$penalty) |>
  dplyr::mutate(
    Importance = abs(Importance),
    Variable = fct_reorder(Variable, Importance)
  ) |>
  ggplot(mapping = aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  dark_theme_minimal()

This "works" as the "importance" plot generated at the end includes all of the co-variates and not just two....

lasso_from_imputed

Thus there is a problem somewhere in the thyroid_recipe that we setup with the code below when applied to the real-life dataset.

thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
  ## @ns-rse 2024-06-14 :
  ## This step can be used to filter observations with missing data, see the manual pages for more details
  ## https://recipes.tidymodels.org/reference/step_filter_missing.html
  recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
  ## @ns-rse 2024-06-14 :
  ## We first normalise the data _before_ we generate dummies otherwise the dummies, which are numerical, get normalised
  ## too
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())

There are three components to the recipe here...

Lets consider them in turn.

recipes::step_filter_missing(recipes::all_predictors(), threshold = 0)

This first step is saying...

Filter missing for all predictors, and remove variables that have any missing data (the threshold = 0 bit)

This may be the problematic step since it will remove any predictor where there is any missing data.

ns-rse commented 1 month ago

The following demonstrates that it is definitely the recipes::step_filter_missing() step that is causing problem.

The steps take the recipes::recipe() and then manually recipes::prep(https://recipes.tidymodels.org/reference/prep.html) it before recipes::bake() (as noted in the documentation here the recommended approach is to use workflows::workflow() as we have done).

str() shows the structure of an R object and this demonstrates that with a threshold = 0 or threshold = 0.5 for removing a column with missing data both remove all but age and gender_M.

All steps

Almost all variables are removed...

split <- df_complete|>
    rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)
+ > > > ## Setup a recipe and workflow with all steps
thyroid_recipe_all<- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_filter_missing(recipes::all_predictors(), threshold = 0) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::prep()
all <- thyroid_recipe_all |> bake(new_data = NULL)
str(all)

> + + + + > > tibble [862 × 3] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan    : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
  ..- attr(*, "label")= chr "Age"
 $ final_pathology: Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
 $ gender_M       : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
> 

threshold = 0.5

With a higher threshold for removing variables (has to have >= 50% missing) more variables are retained.

> thyroid_recipe_high_missing_threshold<- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_filter_missing(recipes::all_predictors(), threshold = 0.5) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::prep()
high_missing_threshold <- thyroid_recipe_high_missing_threshold |> bake(new_data = NULL)
str(high_missing_threshold)

+ + + + > > tibble [862 × 39] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan                      : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
  ..- attr(*, "label")= chr "Age"
 $ albumin                          : 'labelled' num [1:862] 0.598 0.334 NA 1.126 -0.986 ...
  ..- attr(*, "label")= chr "Albumin"
 $ tsh_value                        : 'labelled' num [1:862] -0.737 NA -0.969 0.535 -0.969 ...
  ..- attr(*, "label")= chr "TSH value"
 $ lymphocytes                      : 'labelled' num [1:862] NA 1.421 0.518 -0.453 0.368 ...
  ..- attr(*, "label")= chr "Lymphocytes"
 $ monocyte                         : 'labelled' num [1:862] NA 1.4018 -0.0807 -0.0313 0.0181 ...
  ..- attr(*, "label")= chr "Monocytes"
 $ size_nodule_mm                   : 'labelled' num [1:862] NA 1.164 0.362 2.274 -0.316 ...
  ..- attr(*, "label")= chr "Nodule size (mm)"
 $ final_pathology                  : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
 $ gender_M                         : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
 $ ethnicity_B                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_C                      : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
 $ ethnicity_D                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_F                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_G                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_H                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_J                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_K                      : num [1:862] 1 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_L                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_M                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_N                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_P                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_R                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_S                      : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
 $ ethnicity_Z                      : num [1:862] 0 0 0 1 0 0 0 0 0 0 ...
 $ incidental_nodule_Yes            : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
 $ palpable_nodule_Yes              : num [1:862] 0 1 0 1 0 1 0 0 0 1 ...
 $ rapid_enlargement_Yes            : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_Yes         : num [1:862] 0 0 0 1 0 0 0 0 0 1 ...
 $ hypertension_Yes                 : num [1:862] 0 1 0 0 0 0 0 NA 0 0 ...
 $ vocal_cord_paresis_Yes           : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ graves_disease_Yes               : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ hashimotos_thyroiditis_Yes       : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 0 0 0 1 NA 0 0 ...
 $ exposure_radiation_Yes           : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ bta_u_classification_U2          : num [1:862] 1 0 NA 1 0 0 1 1 0 0 ...
 $ bta_u_classification_U3          : num [1:862] 0 1 NA 0 1 1 0 0 1 1 ...
 $ bta_u_classification_U4          : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U5          : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
 $ solitary_nodule_Yes              : num [1:862] 0 1 0 0 0 1 0 0 1 0 ...
 $ cervical_lymphadenopathy_Yes     : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...

Remove recipes::step_filter_missing() completely

If we remove the filtering of missing step completely all variables are retained.

> thyroid_recipe_skip_filter_missing <- recipes::recipe(final_pathology ~ ., data = train) |>
    ## recipes::step_filter_missing(recipes::all_predictors(), threshold = 0, skip = TRUE) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::prep()
skip_filter_missing<- thyroid_recipe_skip_filter_missing |> bake(new_data = NULL)
str(skip_filter_missing)

> > + + + + > > tibble [862 × 46] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan                      : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
  ..- attr(*, "label")= chr "Age"
 $ albumin                          : 'labelled' num [1:862] 0.598 0.334 NA 1.126 -0.986 ...
  ..- attr(*, "label")= chr "Albumin"
 $ tsh_value                        : 'labelled' num [1:862] -0.737 NA -0.969 0.535 -0.969 ...
  ..- attr(*, "label")= chr "TSH value"
 $ lymphocytes                      : 'labelled' num [1:862] NA 1.421 0.518 -0.453 0.368 ...
  ..- attr(*, "label")= chr "Lymphocytes"
 $ monocyte                         : 'labelled' num [1:862] NA 1.4018 -0.0807 -0.0313 0.0181 ...
  ..- attr(*, "label")= chr "Monocytes"
 $ size_nodule_mm                   : 'labelled' num [1:862] NA 1.164 0.362 2.274 -0.316 ...
  ..- attr(*, "label")= chr "Nodule size (mm)"
 $ final_pathology                  : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
 $ gender_M                         : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
 $ ethnicity_B                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_C                      : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
 $ ethnicity_D                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_F                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_G                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_H                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_J                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_K                      : num [1:862] 1 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_L                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_M                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_N                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_P                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_R                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_S                      : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
 $ ethnicity_Z                      : num [1:862] 0 0 0 1 0 0 0 0 0 0 ...
 $ incidental_nodule_Yes            : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
 $ palpable_nodule_Yes              : num [1:862] 0 1 0 1 0 1 0 0 0 1 ...
 $ rapid_enlargement_Yes            : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_Yes         : num [1:862] 0 0 0 1 0 0 0 0 0 1 ...
 $ hypertension_Yes                 : num [1:862] 0 1 0 0 0 0 0 NA 0 0 ...
 $ vocal_cord_paresis_Yes           : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ graves_disease_Yes               : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ hashimotos_thyroiditis_Yes       : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 0 0 0 1 NA 0 0 ...
 $ exposure_radiation_Yes           : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ bta_u_classification_U2          : num [1:862] 1 0 NA 1 0 0 1 1 0 0 ...
 $ bta_u_classification_U3          : num [1:862] 0 1 NA 0 1 1 0 0 1 1 ...
 $ bta_u_classification_U4          : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U5          : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
 $ solitary_nodule_Yes              : num [1:862] 0 1 0 0 0 1 0 0 1 0 ...
 $ cervical_lymphadenopathy_Yes     : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy1c         : num [1:862] NA 0 NA NA 0 0 NA NA NA 1 ...
 $ thy_classification_Thy2          : num [1:862] NA 1 NA NA 0 0 NA NA NA 0 ...
 $ thy_classification_Thy2c         : num [1:862] NA 0 NA NA 0 0 NA NA NA 0 ...
 $ thy_classification_Thy3a         : num [1:862] NA 0 NA NA 1 0 NA NA NA 0 ...
 $ thy_classification_Thy3f         : num [1:862] NA 0 NA NA 0 1 NA NA NA 0 ...
 $ thy_classification_Thy4          : num [1:862] NA 0 NA NA 0 0 NA NA NA 0 ...
 $ thy_classification_Thy5          : num [1:862] NA 0 NA NA 0 0 NA NA NA 0 ...
> 

threshold = 0.1

If we exclude variables with > 10% of observations missing we retain the following variables.


> thyroid_recipe_investigate_missing_threshold<- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_filter_missing(recipes::all_predictors(), threshold = 0.1) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::prep()
investigate_missing_threshold <- thyroid_recipe_investigate_missing_threshold |> bake(new_data = NULL)
str(investigate_missing_threshold)

+ + + + > > tibble [862 × 31] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan                 : 'labelled' num [1:862] -0.361 0.978 -0.884 0.454 0.105 ...
  ..- attr(*, "label")= chr "Age"
 $ final_pathology             : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 2 1 1 1 1 ...
 $ gender_M                    : num [1:862] 1 0 0 0 0 0 0 0 1 0 ...
 $ ethnicity_B                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_C                 : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
 $ ethnicity_D                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_F                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_G                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_H                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_J                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_K                 : num [1:862] 1 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_L                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_M                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_N                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_P                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_R                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_S                 : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
 $ ethnicity_Z                 : num [1:862] 0 0 0 1 0 0 0 0 0 0 ...
 $ incidental_nodule_Yes       : num [1:862] 0 0 0 0 0 0 0 0 1 0 ...
 $ palpable_nodule_Yes         : num [1:862] 0 1 0 1 0 1 0 0 0 1 ...
 $ rapid_enlargement_Yes       : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_Yes    : num [1:862] 0 0 0 1 0 0 0 0 0 1 ...
 $ vocal_cord_paresis_Yes      : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ graves_disease_Yes          : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ hashimotos_thyroiditis_Yes  : num [1:862] 0 0 0 0 0 0 0 NA 0 0 ...
 $ bta_u_classification_U2     : num [1:862] 1 0 NA 1 0 0 1 1 0 0 ...
 $ bta_u_classification_U3     : num [1:862] 0 1 NA 0 1 1 0 0 1 1 ...
 $ bta_u_classification_U4     : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U5     : num [1:862] 0 0 NA 0 0 0 0 0 0 0 ...
 $ solitary_nodule_Yes         : num [1:862] 0 1 0 0 0 1 0 0 1 0 ...
 $ cervical_lymphadenopathy_Yes: num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
> 

We can then try fitting a model using this recipe with a higher threshold.

thyroid_recipe_investigate_missing_threshold<- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_filter_missing(recipes::all_predictors(), threshold = 0.1) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors())
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe_investigate_missing_threshold)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
  parsnip::set_engine("glmnet")
## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
  object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
  resamples = cv_folds,
  grid = dials::grid_regular(penalty(), levels = 50)
)
## K-fold best fit for LASSO
lasso_kfold_roc_auc <- lasso_grid |>
  tune::select_best(metric = "roc_auc")
+ + + > + > > + > > + + + + → A | warning: ! There are new levels in `ethnicity`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
→ B | error:   x has missing values; consider using makeX() to impute them
→ C | warning: ! There are new levels in `ethnicity`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `exposure_radiation`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
→ D | warning: ! There are new levels in `ethnicity`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hypertension`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `exposure_radiation`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
→ E | warning: ! There are new levels in `ethnicity`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `incidental_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `palpable_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `rapid_enlargement`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `compressive_symptoms`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hypertension`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `vocal_cord_paresis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `graves_disease`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `hashimotos_thyroiditis`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `bta_u_classification`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `solitary_nodule`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values., ! There are new levels in `cervical_lymphadenopathy`: NA.
               ℹ Consider using step_unknown() (`?recipes::step_unknown()`) before `step_dummy()` to handle missing values.
There were issues with some computations   A: x51   B: x100   C: x20   D: x19   E: x10
> > + Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.

Not great, !All models failed :cry:

But we have some useful information on things we can try recipes::step_unknown() could be useful here and we could perhaps be smarter about what predictor variables we are including in the model by leveraging recipes::step_corr() to remove correlated variables, perhaps preferentially retaining those with the most complete data to minimise the impact of missing data.

ns-rse commented 1 month ago

recipes::step_corr()

Lets start by removing recipes::step_filter_missing() and look at what normalisation and dummy steps do...

recipes::step_normalize() and recipes::step_dummy()

split <- df_complete|>
    rsample::initial_split(prop = 0.75)
train <- rsample::training(split)
test <- rsample::testing(split)

## Without removing anything what doe we have
thyroid_recipe <- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::prep() |>
    recipes::bake(new_data = NULL)
str(thyroid_recipe)

+ + + + > tibble [862 × 46] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan                      : 'labelled' num [1:862] -0.132 -0.832 -2.115 -1.124 -0.657 ...
  ..- attr(*, "label")= chr "Age"
 $ albumin                          : 'labelled' num [1:862] NA NA -0.759 NA -0.759 ...
  ..- attr(*, "label")= chr "Albumin"
 $ tsh_value                        : 'labelled' num [1:862] NA NA NA -0.25 -0.143 ...
  ..- attr(*, "label")= chr "TSH value"
 $ lymphocytes                      : 'labelled' num [1:862] NA NA -1.254 0.314 0.273 ...
  ..- attr(*, "label")= chr "Lymphocytes"
 $ monocyte                         : 'labelled' num [1:862] NA NA -0.279 -0.33 0.285 ...
  ..- attr(*, "label")= chr "Monocytes"
 $ size_nodule_mm                   : 'labelled' num [1:862] -0.8169 0.0526 3.7788 NA -1.1398 ...
  ..- attr(*, "label")= chr "Nodule size (mm)"
 $ final_pathology                  : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender_M                         : num [1:862] 0 0 0 0 1 0 0 0 0 1 ...
 $ ethnicity_B                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_C                      : num [1:862] 0 0 0 0 0 1 0 0 0 0 ...
 $ ethnicity_D                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_F                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_G                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_H                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_J                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_K                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_L                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_M                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_N                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_P                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_R                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_S                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_Z                      : num [1:862] 0 0 0 1 1 0 0 0 0 0 ...
 $ incidental_nodule_Yes            : num [1:862] 0 0 0 1 1 0 0 1 0 1 ...
 $ palpable_nodule_Yes              : num [1:862] 1 NA 1 0 0 0 1 0 1 0 ...
 $ rapid_enlargement_Yes            : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_Yes         : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
 $ hypertension_Yes                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ vocal_cord_paresis_Yes           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ graves_disease_Yes               : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ hashimotos_thyroiditis_Yes       : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 NA 0 NA 0 0 0 0 ...
 $ exposure_radiation_Yes           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U2          : num [1:862] 1 1 1 1 1 1 1 1 1 0 ...
 $ bta_u_classification_U3          : num [1:862] 0 0 0 0 0 0 0 0 0 1 ...
 $ bta_u_classification_U4          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U5          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ solitary_nodule_Yes              : num [1:862] 0 0 1 0 1 0 0 0 1 0 ...
 $ cervical_lymphadenopathy_Yes     : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy1c         : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy2          : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy2c         : num [1:862] NA NA 1 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy3a         : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy3f         : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy4          : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy5          : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
> 

+ recipes::step_corr(threshold = 0.9)

Now add in the recipes_step_corr(threshold = 0.9)

thyroid_recipe_step_corr <- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::step_corr(all_predictors(), threshold = 0.9) |>
    recipes::prep() |>
    recipes::bake(new_data = NULL)
str(thyroid_recipe_step_corr)

> thyroid_recipe_step_corr <- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::step_corr(all_predictors(), threshold = 0.9) |>
    recipes::prep() |>
    recipes::bake(new_data = NULL)
str(thyroid_recipe_step_corr)

+ + + + + > tibble [862 × 45] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan                      : 'labelled' num [1:862] -0.132 -0.832 -2.115 -1.124 -0.657 ...
  ..- attr(*, "label")= chr "Age"
 $ albumin                          : 'labelled' num [1:862] NA NA -0.759 NA -0.759 ...
  ..- attr(*, "label")= chr "Albumin"
 $ tsh_value                        : 'labelled' num [1:862] NA NA NA -0.25 -0.143 ...
  ..- attr(*, "label")= chr "TSH value"
 $ lymphocytes                      : 'labelled' num [1:862] NA NA -1.254 0.314 0.273 ...
  ..- attr(*, "label")= chr "Lymphocytes"
 $ monocyte                         : 'labelled' num [1:862] NA NA -0.279 -0.33 0.285 ...
  ..- attr(*, "label")= chr "Monocytes"
 $ size_nodule_mm                   : 'labelled' num [1:862] -0.8169 0.0526 3.7788 NA -1.1398 ...
  ..- attr(*, "label")= chr "Nodule size (mm)"
 $ final_pathology                  : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender_M                         : num [1:862] 0 0 0 0 1 0 0 0 0 1 ...
 $ ethnicity_B                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_C                      : num [1:862] 0 0 0 0 0 1 0 0 0 0 ...
 $ ethnicity_D                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_F                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_G                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_H                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_J                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_K                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_L                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_M                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_N                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_P                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_R                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_S                      : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_Z                      : num [1:862] 0 0 0 1 1 0 0 0 0 0 ...
 $ incidental_nodule_Yes            : num [1:862] 0 0 0 1 1 0 0 1 0 1 ...
 $ palpable_nodule_Yes              : num [1:862] 1 NA 1 0 0 0 1 0 1 0 ...
 $ rapid_enlargement_Yes            : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_Yes         : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
 $ hypertension_Yes                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ vocal_cord_paresis_Yes           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ graves_disease_Yes               : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ hashimotos_thyroiditis_Yes       : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ family_history_thyroid_cancer_Yes: num [1:862] 0 0 0 NA 0 NA 0 0 0 0 ...
 $ exposure_radiation_Yes           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U3          : num [1:862] 0 0 0 0 0 0 0 0 0 1 ...
 $ bta_u_classification_U4          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U5          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ solitary_nodule_Yes              : num [1:862] 0 0 1 0 1 0 0 0 1 0 ...
 $ cervical_lymphadenopathy_Yes     : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy1c         : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy2          : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy2c         : num [1:862] NA NA 1 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy3a         : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy3f         : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy4          : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
 $ thy_classification_Thy5          : num [1:862] NA NA 0 NA NA NA NA NA NA NA ...
> 

What is different? We can use the waldo::compare() package and function to tell us quickly if these are the same.

install.packages("waldo")

waldo::compare(thyroid_recipe, thyroid_recipe_step_corr)
`old` is length 46
`new` is length 45

     names(old)                          | names(new)                              
[31] "hashimotos_thyroiditis_Yes"        | "hashimotos_thyroiditis_Yes"        [31]
[32] "family_history_thyroid_cancer_Yes" | "family_history_thyroid_cancer_Yes" [32]
[33] "exposure_radiation_Yes"            | "exposure_radiation_Yes"            [33]
[34] "bta_u_classification_U2"           -                                         
[35] "bta_u_classification_U3"           | "bta_u_classification_U3"           [34]
[36] "bta_u_classification_U4"           | "bta_u_classification_U4"           [35]
[37] "bta_u_classification_U5"           | "bta_u_classification_U5"           [36]

`old$bta_u_classification_U2` is a double vector (1, 1, 1, 1, 1, ...)
`new$bta_u_classification_U2` is absent

...which tells us that one level of bta_u_classification (the U2 level) has been dropped by by recipes::step_corr().

Ok, not too much benefit to doing this but that is considerably quicker than comparing by eye and trying to work out what is different.

ns-rse commented 1 month ago

recipes::step_unknown()

Lets now investigate the effect of using recipes::step_unknown() on our recipe. This step needs to be run before recipes::step_dummy() and only works on nominal/binary/factor variables.

We again use waldo::compare() to see how the resulting data set differs from the thyroid_recipe from the above step which normalises and creates dummy variables.

thyroid_recipe_step_unknown <- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_unknown(recipes::all_nominal_predictors(), new_level = "unknown") |>
    recipes::step_dummy(recipes::all_nominal_predictors()) |>
    recipes::prep() |>
    recipes::bake(new_data = NULL)
str(thyroid_recipe_step_unknown)

+ + + + + > tibble [862 × 62] (S3: tbl_df/tbl/data.frame)
 $ age_at_scan                          : 'labelled' num [1:862] -0.132 -0.832 -2.115 -1.124 -0.657 ...
  ..- attr(*, "label")= chr "Age"
 $ albumin                              : 'labelled' num [1:862] NA NA -0.759 NA -0.759 ...
  ..- attr(*, "label")= chr "Albumin"
 $ tsh_value                            : 'labelled' num [1:862] NA NA NA -0.25 -0.143 ...
  ..- attr(*, "label")= chr "TSH value"
 $ lymphocytes                          : 'labelled' num [1:862] NA NA -1.254 0.314 0.273 ...
  ..- attr(*, "label")= chr "Lymphocytes"
 $ monocyte                             : 'labelled' num [1:862] NA NA -0.279 -0.33 0.285 ...
  ..- attr(*, "label")= chr "Monocytes"
 $ size_nodule_mm                       : 'labelled' num [1:862] -0.8169 0.0526 3.7788 NA -1.1398 ...
  ..- attr(*, "label")= chr "Nodule size (mm)"
 $ final_pathology                      : Factor w/ 2 levels "Benign","Cancer": 1 1 1 1 1 1 1 1 1 1 ...
 $ gender_M                             : num [1:862] 0 0 0 0 1 0 0 0 0 1 ...
 $ gender_unknown                       : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_B                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_C                          : num [1:862] 0 0 0 0 0 1 0 0 0 0 ...
 $ ethnicity_D                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_F                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_G                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_H                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_J                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_K                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_L                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_M                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_N                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_P                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_R                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_S                          : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity_Z                          : num [1:862] 0 0 0 1 1 0 0 0 0 0 ...
 $ ethnicity_unknown                    : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ incidental_nodule_Yes                : num [1:862] 0 0 0 1 1 0 0 1 0 1 ...
 $ incidental_nodule_unknown            : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ palpable_nodule_Yes                  : num [1:862] 1 0 1 0 0 0 1 0 1 0 ...
 $ palpable_nodule_unknown              : num [1:862] 0 1 0 0 0 0 0 0 0 0 ...
 $ rapid_enlargement_Yes                : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ rapid_enlargement_unknown            : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_Yes             : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
 $ compressive_symptoms_unknown         : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ hypertension_Yes                     : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ hypertension_unknown                 : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ vocal_cord_paresis_Yes               : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ vocal_cord_paresis_unknown           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ graves_disease_Yes                   : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ graves_disease_unknown               : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ hashimotos_thyroiditis_Yes           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ hashimotos_thyroiditis_unknown       : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ family_history_thyroid_cancer_Yes    : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ family_history_thyroid_cancer_unknown: num [1:862] 0 0 0 1 0 1 0 0 0 0 ...
 $ exposure_radiation_Yes               : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ exposure_radiation_unknown           : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U2              : num [1:862] 1 1 1 1 1 1 1 1 1 0 ...
 $ bta_u_classification_U3              : num [1:862] 0 0 0 0 0 0 0 0 0 1 ...
 $ bta_u_classification_U4              : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_U5              : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ bta_u_classification_unknown         : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ solitary_nodule_Yes                  : num [1:862] 0 0 1 0 1 0 0 0 1 0 ...
 $ solitary_nodule_unknown              : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ cervical_lymphadenopathy_Yes         : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ cervical_lymphadenopathy_unknown     : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy1c             : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy2              : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy2c             : num [1:862] 0 0 1 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy3a             : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy3f             : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy4              : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_Thy5              : num [1:862] 0 0 0 0 0 0 0 0 0 0 ...
 $ thy_classification_unknown           : num [1:862] 1 1 0 1 1 1 1 1 1 1 ...
> > > 

And comparing them we see...

## Compare the packages
waldo::compare(thyroid_recipe, thyroid_recipe_step_unknown)

> waldo::compare(thyroid_recipe, thyroid_recipe_step_unknown)

`old` is length 46
`new` is length 62

`names(old)[6:11]`: "size_nodule_mm" "final_pathology" "gender_M"                  "ethnicity_B" "ethnicity_C" "ethnicity_D"
`names(new)[6:12]`: "size_nodule_mm" "final_pathology" "gender_M" "gender_unknown" "ethnicity_B" "ethnicity_C" "ethnicity_D"

     names(old)              | names(new)                                 
[21] "ethnicity_R"           | "ethnicity_R"               [22]           
[22] "ethnicity_S"           | "ethnicity_S"               [23]           
[23] "ethnicity_Z"           | "ethnicity_Z"               [24]           
                             - "ethnicity_unknown"         [25]           
[24] "incidental_nodule_Yes" | "incidental_nodule_Yes"     [26]           
                             - "incidental_nodule_unknown" [27]           
[25] "palpable_nodule_Yes"   | "palpable_nodule_Yes"       [28]           
                             - "palpable_nodule_unknown"   [29]           
[26] "rapid_enlargement_Yes" | "rapid_enlargement_Yes"     [30]           
                             - "rapid_enlargement_unknown" [31]           
 ... ...                       ...                         and 26 more ...

`names(old)[44:46]`: "thy_classification_Thy3f" "thy_classification_Thy4" "thy_classification_Thy5"                             
`names(new)[59:62]`: "thy_classification_Thy3f" "thy_classification_Thy4" "thy_classification_Thy5" "thy_classification_unknown"

`old$ethnicity_B[38:44]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[38:44]`: 0 0 0  0 0 0 0

`old$ethnicity_B[48:54]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[48:54]`: 0 0 0  0 0 0 0

`old$ethnicity_B[58:64]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[58:64]`: 0 0 0  0 0 0 0

`old$ethnicity_B[110:116]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[110:116]`: 0 0 0  0 0 0 0

`old$ethnicity_B[133:139]`: 0 0 0 NA 0 0 0
`new$ethnicity_B[133:139]`: 0 0 0  0 0 0 0

`old$ethnicity_B[171:183]`: 0 0 0 NA 0 0 0 0 0 NA and 3 more...
`new$ethnicity_B[171:183]`: 0 0 0  0 0 0 0 0 0  0           ...

And 969 more differences ...

There are a bunch of new dummy variables as there are now 62 variables in thyroid_recipe_step_unknown compared to only 46 in thyroid_recipe and these are all columns of dummy variables with _unknown appended.

Can we fit a model with this?

Its not particularly useful to have an additional unknown category in any of our variables because if such categories end up being predictive of outcome its not very informative. In clinic would you not do a test or assessment to increase your predictive accuracy? No you wouldn't!

But lets have a go at fitting a model anyway...

> thyroid_recipe_step_unknown <- recipes::recipe(final_pathology ~ ., data = train) |>
    recipes::step_normalize(recipes::all_numeric_predictors()) |>
    recipes::step_unknown(recipes::all_nominal_predictors(), new_level = "unknown") |>
    recipes::step_dummy(recipes::all_nominal_predictors())
thyroid_workflow <- workflows::workflow() |>
  workflows::add_recipe(thyroid_recipe_step_unknown)
cv_folds <- rsample::vfold_cv(train, v = 10, repeats = 10)
tune_spec_lasso <- parsnip::logistic_reg(penalty = hardhat::tune(), mixture = 1) |>
  parsnip::set_engine("glmnet")
## Tune the LASSO parameters via cross-validation
lasso_grid <- tune::tune_grid(
  object = workflows::add_model(thyroid_workflow, tune_spec_lasso),
  resamples = cv_folds,
  grid = dials::grid_regular(penalty(), levels = 50)
)

+ + + > + > > + > > + + + + → A | error:   x has missing values; consider using makeX() to impute them
There were issues with some computations   A: x100
> 

Oh dear, there were errors with all of the fitted models because we still have missing data in the continuous variables.

These are parameters such as albumin and the other blood assays. As this data set has been retrospectively completed it will be impossible to go back and get assays from those who did not have blood samples taken and assays performed.

What to do?

There are a couple of options available at this point...

  1. Massively reduce the predictor variables/covariates being tested.
  2. Impute missing values.

I have started work on imputing missing variables and as noted above this allows us to fit models. Personally I'm not a great fan of imputation its just filling in the gaps based on correlation and assumes that data is missing at random (MAR), which is very hard to assess and likely untrue (e.g. blood tests may be more likely on young people compared to old or there may be a bias for blood tests if imaging shows the thyroid nodule to be larger).

ns-rse commented 1 month ago

As it looks like we will be using imputed datasets it would be worthwhile working through the MICE Vignettes that cover important topics the first six or so articles cover.

It will be important for each method of imputation used to combine the inferences and get an average, these different imputation methods can then be compared to each other.

Sensitivity analysis will be important too.

ns-rse commented 1 month ago

Closing as its clear this issue is down to missing data being removed by the recipes::step_*() stages prior to running workflows.

Strategy is to use imputed datasets but this introduces additional work to average models (see #32)