[ ] Are the warnings that are generated by the modelling important?
[ ] Interaction terms
[ ] Trend variable
[ ] Offset
[ ] using glmnet to penalise variables that have a detrimental effect on the modelling

`glmnet`

When moving the logistic regression code over to glmnet and then apply the tune() function to selecting the penalty and mixture hyperparameters, I start to get errors when fitting the model. This looks to be because the outcome variable is a 2 column matrix c(remainder, numerator). I have narrowed this down to an issue with glmnet rather than tune().

Testing `outcome = c(remainder, numerator)` compared with `outcome = value`

Using the same predictors, I have worked through fitting a model to predict both val and c(remainder, numerator) (both using tidymodels and raw glm). Fitting a model to predict val produces a model with different coefficients to a model predicting c(remainder, numerator) - though the different modelling approaches are consistent with each other (eg, tidymodels to predict val gives the same model as the glm model to predict val).

By using case weights, as shown in this file, I've managed to make the two approaches consistent. This has been implemented in the main modelling code in this commit: 861fbdeb645d68a95f12ab7d591225a6c10bf43c

`glmnet` observations

After updating the modelling code to use glmnet along with hyperparameter tuning for the penalty and mixture variables, I can see when testing on 2019 data (eg, avoiding the pandemic period), the performance of the model improves enormously. For predicting 2022 data, the performance is best when only using 2 years of data (eg, 2020 and 2021, but then deteriorates if using data prior to the pandemic).

Removing pandemic years from training data

When the years 2020 and 2021 are removed from the training data to predict 2022 data, the performance of the model deteriorates:

Increasing training/validation ratio and removing zero variance variables

The number of training records compared with the validation records, up to now, has been dependent on the length of the training years. It has been calculated as follows:

$$\frac{n{t{x-1}}}{n_{t1...t{x-2}}}$$

where n = number of records, t = year, x = year that models are predicting (eg, test data). For example, when using 2020 and 2021 data to predict 2022 data, there are two years of data in the training set, and these have been split 50:50. When there are more years in the training set, the ratio of training to validation increases.

Here I have fixed the ratio of the count of training to validation records to 75:25.

In addition, the predictor variables that have zero variance are manually removed before the normalisation step occurs.

These steps significantly improves performance when predicting 2022 data:

Cross validation

Moving to cross validation as a method of improving hyperparameter tuning should make the training process more robust - especially because I have a small training dataset and I can stratify the folds by the outcome variable. It doesn't seem to have much of an improvement though on the previous performance:

warnings

I receive two warnings when running the glmnet model:

glmnet.fit: algorithm did not converge

The argument λ is a coefficient in the glmnet equation. When tuning the mixture and penalty hyperparameters, this coefficient increases between two values, and model convergence is assessed. The glmnet function has an argument called nlambda which defaults to 100. This controls the maximum number of λ to assess model convergence over. If the number is too small, the steps between each λ will be bigger, and could result in difficult when trying to achieve convergence. This commit (2219f933c65247fb5d1cc4fd56d449b00f492710) increases the value for nlambda, which results in more models converging and the warning not appearing as often. I suppose the risk of this could be that models could be deemed to converge earlier than they otherwise would. When changing the value of nlambda from 100 to 600, the runtime increased, more models converged, but the performance didn't increase. Therefore it seemed there is little benefit of doing this.

A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.

This occurs when the model predicts the same value for every record. When incorporating lasso regression, this is likely to occur when all of the predictor variables have been removed as a result of the penalisation process, just leaving the intercept. My question is now, why does this occur so often and how do I know whether the final model is ok?

In order to answer this I am describing the current situation so I can clarify my thoughts:

I am running the logistic modelling process on 15 different scenarios (0, 1 or 2 lagged years combined with 2 to 6 training years)
Each modelling process splits the data into train (60%), validation (20%) and test (20%)
Train and validation data are combined and then split into 7 folds, where one NHS region is left out at a time. Then:
- For each of the 7 folds, multiple GLM logistic regression models (with quasibinomial logit link function) are fitted to the remaining 6/7ths of the training/val set.
- The number of models that are fitted depends on the "tuning grid", which is a table representing different combination of values for lambda (regularisation penalty) and alpha (the proportion of ridge to lasso regression).
- The glmnet algorithm generates a "path" of lambda values that it fits models along. The "path" can be auto-generated by the function itself, or it can be supplied. Additionally, the user can also specify the number of lambda (nlambda) values that comprise the path (defaults to 100). This means that, when nlambda is 100, there will be 100*nrow(tuning_grid) models fitted to each fold.
- Once repeated for each fold, the evaluation metrics are calculated for the interpolated penalty value based on the path of penalty values, and then averaged for each fold over each of the tuning grid combinations, and the "best" one is selected to identify the best performing values of lambda and alpha.
See https://github.com/nhs-bnssg-analytics/d_and_c/issues/24#issuecomment-1919082727 for further discussion on how the glmnet modelling process works

I have worked through the detail of this and can see where the warning occurs. When the models are being fitted along the path of λ values - once the value of λ is suitably high, it is possible to see that the "estimate" (or coefficient) of every predictor variable is 0 other than the "(Intercept)" term. As a result, at these high λ values, an R² can't be calculated and the warning is shown.

In this commit (fd615c8b8108a29ead6d34f3b306405e387ac48f) I have removed the option of R² as an evaluation metric. This has been done because it doesn't give well shaped predictions when compared with the observed data (see below). As a result, the warning has gone.

Not splitting data sequentially

Up until now, I have ordered the observations and separated the final year of data for testing. In theory, this was done with practical application in mind. I thought that if we could build a model on the years prior to the year that we want to use the model for, and the model performed well, then it would give users confidence in the model and it would encourage them to use it more. In practice, say I am testing data on 2022 based on data from 2020 and 2021, then we couldn't implement that 2020/2021 model for 2023 because we'd be missing a whole year of information. By mixing all of the years, this uses all of the latest information to build a model to make it as relevant for application as possible. It hugely improves performance and shows that the more information you feed into the model the better it performs:

Questioning the high performance of the model

The current performance of the models on the train, validation and test dataset is very good (best performing models have an R2 of 0.95, which is a little unbelievable).

Ensure I understand the effect of ridge and lasso regression

This commit adds a script which demonstrates strong modelling performance, along with tweaking the mixture argument for the glmnet function to 0 (pure ridge regression) and 1 (pure lasso regression) to show that each of them perform as expected (ridge regression still retains every variable but lasso regression removes variables).

Are there identical observations in train/val/test

If there are identical observations it makes it easy for the model to perform well. A way to test this is to perform "leave-p-out-cross-validation", where you can leave a group out in each fold of the cross validation process. This commit implements this change, which has the effect of slightly lowering the performance of the model, but the performance is still very good:

Variation in validation set

Useful to understand the variation in performance for the different folds in the validation set. The chart below has points plotted on it for each of the folds (eg, one fold is where one region is omitted for training and is used for validating). The points displayed are for where the hyperparameters are optimised by the average over all of the folds:

Understanding the variation in the target variable

The target variable has a strange distribution. The distribution has two humps, which are most likely COVID related, but need further investigation to make sure they are being calculated properly:

Error metric

I'm currently using R² as the metric to tune the model on, but it gives an unusual shape when plotting observed versus predicted:

Changing the error metric to RMSE, we get a better overall picture of observed versus expected:

Need to investigate other metrics.

COVID variables

Is there something wrong with including the COVID variables. For many of the years their value is 0 for all ICSs (though this is centred and scaled). Will this have a strange effect on the modelling? At the moment, they don't come out particularly high when looking at variable importance.

Dataset dimensions

Is there anything wrong with having a dataset that is wider than it is long? At the moment, there are 378 observations and 112 variables in the full dataset (train + val + test). When the function performs the modelling, the training set will be a subset of those observations. If we include lagged years, the number of variables will (almost) repeat for each year that is lagged. Consider the Curse of Dimensionality.

When using an elastic net regularisation process with a generalised linear model, this doesn't seem like an issue as the process removes variables as part of the hyperparameter training process.

FTE variables

Model performance increased significantly when I split up the FTE variables from just Staff Group into Staff Group by setting (eg, acute, community trust and MH). Lots of the Community Trust and MH variables 0 for all time periods for a large number of ICSs.

Before I remove the variables, this is what the performance chart looks like (for 2 lagged years only):

Here is the performance chart after removing all of the Community Trust FTEs:

It looks like these variables have no impact on the prediction. They are most likely removed from the modelling during the regularisation process.

Variable importance

Should look into the most important variables further. Perhaps there is something incorrect in their data that is making them very predictive.

What variables are used in the final model?

This article is really useful to understand how glmnet is applied through {parsnip}. There are two values that are tuned in the hyperparameter tuning process; penalty (aka lambda - the coefficient applied to penalise the error terms with) and mixture (the extent of either lasso or ridge regression).

The hyperparameter tuning process selects the two parameters above that produce the best performance on the validation dataset. However, despite penalty (lambda) having an optimal value, when fitting the best model to the final training dataset, the glmnet() process still follows a "lambda path" of nlambda values, whose range will contain the optimal penalty (lambda). The parameters used in the final fit will be the ones that still remain when the lambda path is nearest the optimal penalty.

The following code helps understand the last_fit object more:

The lambda path followed when performing the last fit can be found using the following code ft$.workflow[[1]]$fit$fit$fit$lambda (note, `ft$.workflow[[1]]$fit$fit$fit shows the degrees of freedom at each lambda value, which is equivalent to the number of variables remaining in the model
The optimal penalty based on hyperparameter tuning is: ft$.workflow[[1]]$fit$fit$spec$args$penalty

nhs-bnssg-analytics / d_and_c

Logistic regression understanding #24

`glmnet`

Testing `outcome = c(remainder, numerator)` compared with `outcome = value`

`glmnet` observations

Removing pandemic years from training data

Increasing training/validation ratio and removing zero variance variables

Cross validation

warnings

glmnet.fit: algorithm did not converge

A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.

Not splitting data sequentially

Questioning the high performance of the model

Ensure I understand the effect of ridge and lasso regression

Are there identical observations in train/val/test

Variation in validation set

Understanding the variation in the target variable

Error metric

COVID variables

Dataset dimensions

FTE variables

Variable importance

What variables are used in the final model?

nhs-bnssg-analytics / d_and_c

Logistic regression understanding #24

glmnet

Testing outcome = c(remainder, numerator) compared with outcome = value

glmnet observations

Removing pandemic years from training data

Increasing training/validation ratio and removing zero variance variables

Cross validation

warnings

glmnet.fit: algorithm did not converge

A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 error. NA will be returned.

Not splitting data sequentially

Questioning the high performance of the model

Ensure I understand the effect of ridge and lasso regression

Are there identical observations in train/val/test

Variation in validation set

Understanding the variation in the target variable

Error metric

COVID variables

Dataset dimensions

FTE variables

Variable importance

What variables are used in the final model?

`glmnet`

Testing `outcome = c(remainder, numerator)` compared with `outcome = value`

`glmnet` observations

A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.