nhs-bnssg-analytics / d_and_c

Scoping the possibility of predicting performance from demand and capacity metrics
1 stars 0 forks source link

Logistic regression understanding #24

Open sebsfox opened 5 months ago

sebsfox commented 5 months ago
sebsfox commented 5 months ago

glmnet

When moving the logistic regression code over to glmnet and then apply the tune() function to selecting the penalty and mixture hyperparameters, I start to get errors when fitting the model. This looks to be because the outcome variable is a 2 column matrix c(remainder, numerator). I have narrowed this down to an issue with glmnet rather than tune().

Testing outcome = c(remainder, numerator) compared with outcome = value

Using the same predictors, I have worked through fitting a model to predict both val and c(remainder, numerator) (both using tidymodels and raw glm). Fitting a model to predict val produces a model with different coefficients to a model predicting c(remainder, numerator) - though the different modelling approaches are consistent with each other (eg, tidymodels to predict val gives the same model as the glm model to predict val).

By using case weights, as shown in this file, I've managed to make the two approaches consistent. This has been implemented in the main modelling code in this commit: 861fbdeb645d68a95f12ab7d591225a6c10bf43c

glmnet observations

After updating the modelling code to use glmnet along with hyperparameter tuning for the penalty and mixture variables, I can see when testing on 2019 data (eg, avoiding the pandemic period), the performance of the model improves enormously. For predicting 2022 data, the performance is best when only using 2 years of data (eg, 2020 and 2021, but then deteriorates if using data prior to the pandemic).

image

Removing pandemic years from training data

When the years 2020 and 2021 are removed from the training data to predict 2022 data, the performance of the model deteriorates:

image

Increasing training/validation ratio and removing zero variance variables

The number of training records compared with the validation records, up to now, has been dependent on the length of the training years. It has been calculated as follows:

$$\frac{n{t{x-1}}}{n_{t1...t{x-2}}}$$

where n = number of records, t = year, x = year that models are predicting (eg, test data). For example, when using 2020 and 2021 data to predict 2022 data, there are two years of data in the training set, and these have been split 50:50. When there are more years in the training set, the ratio of training to validation increases.

Here I have fixed the ratio of the count of training to validation records to 75:25.

In addition, the predictor variables that have zero variance are manually removed before the normalisation step occurs.

These steps significantly improves performance when predicting 2022 data:

image

Cross validation

Moving to cross validation as a method of improving hyperparameter tuning should make the training process more robust - especially because I have a small training dataset and I can stratify the folds by the outcome variable. It doesn't seem to have much of an improvement though on the previous performance:

image

sebsfox commented 5 months ago

warnings

I receive two warnings when running the glmnet model:

glmnet.fit: algorithm did not converge

The argument λ is a coefficient in the glmnet equation. When tuning the mixture and penalty hyperparameters, this coefficient increases between two values, and model convergence is assessed. The glmnet function has an argument called nlambda which defaults to 100. This controls the maximum number of λ to assess model convergence over. If the number is too small, the steps between each λ will be bigger, and could result in difficult when trying to achieve convergence. This commit (2219f933c65247fb5d1cc4fd56d449b00f492710) increases the value for nlambda, which results in more models converging and the warning not appearing as often. I suppose the risk of this could be that models could be deemed to converge earlier than they otherwise would. When changing the value of nlambda from 100 to 600, the runtime increased, more models converged, but the performance didn't increase. Therefore it seemed there is little benefit of doing this.

A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 error. NA will be returned.

This occurs when the model predicts the same value for every record. When incorporating lasso regression, this is likely to occur when all of the predictor variables have been removed as a result of the penalisation process, just leaving the intercept. My question is now, why does this occur so often and how do I know whether the final model is ok?

In order to answer this I am describing the current situation so I can clarify my thoughts:

I have worked through the detail of this and can see where the warning occurs. When the models are being fitted along the path of λ values - once the value of λ is suitably high, it is possible to see that the "estimate" (or coefficient) of every predictor variable is 0 other than the "(Intercept)" term. As a result, at these high λ values, an R2 can't be calculated and the warning is shown.

In this commit (fd615c8b8108a29ead6d34f3b306405e387ac48f) I have removed the option of R2 as an evaluation metric. This has been done because it doesn't give well shaped predictions when compared with the observed data (see below). As a result, the warning has gone.

sebsfox commented 5 months ago

Not splitting data sequentially

Up until now, I have ordered the observations and separated the final year of data for testing. In theory, this was done with practical application in mind. I thought that if we could build a model on the years prior to the year that we want to use the model for, and the model performed well, then it would give users confidence in the model and it would encourage them to use it more. In practice, say I am testing data on 2022 based on data from 2020 and 2021, then we couldn't implement that 2020/2021 model for 2023 because we'd be missing a whole year of information. By mixing all of the years, this uses all of the latest information to build a model to make it as relevant for application as possible. It hugely improves performance and shows that the more information you feed into the model the better it performs:

image

sebsfox commented 5 months ago

Questioning the high performance of the model

The current performance of the models on the train, validation and test dataset is very good (best performing models have an R2 of 0.95, which is a little unbelievable).

Ensure I understand the effect of ridge and lasso regression

This commit adds a script which demonstrates strong modelling performance, along with tweaking the mixture argument for the glmnet function to 0 (pure ridge regression) and 1 (pure lasso regression) to show that each of them perform as expected (ridge regression still retains every variable but lasso regression removes variables).

Are there identical observations in train/val/test

If there are identical observations it makes it easy for the model to perform well. A way to test this is to perform "leave-p-out-cross-validation", where you can leave a group out in each fold of the cross validation process. This commit implements this change, which has the effect of slightly lowering the performance of the model, but the performance is still very good:

image

Variation in validation set

Useful to understand the variation in performance for the different folds in the validation set. The chart below has points plotted on it for each of the folds (eg, one fold is where one region is omitted for training and is used for validating). The points displayed are for where the hyperparameters are optimised by the average over all of the folds:

image

Understanding the variation in the target variable

The target variable has a strange distribution. The distribution has two humps, which are most likely COVID related, but need further investigation to make sure they are being calculated properly:

image

Error metric

I'm currently using R2 as the metric to tune the model on, but it gives an unusual shape when plotting observed versus predicted:

image

Changing the error metric to RMSE, we get a better overall picture of observed versus expected:

image

Need to investigate other metrics.

COVID variables

Is there something wrong with including the COVID variables. For many of the years their value is 0 for all ICSs (though this is centred and scaled). Will this have a strange effect on the modelling? At the moment, they don't come out particularly high when looking at variable importance.

Dataset dimensions

Is there anything wrong with having a dataset that is wider than it is long? At the moment, there are 378 observations and 112 variables in the full dataset (train + val + test). When the function performs the modelling, the training set will be a subset of those observations. If we include lagged years, the number of variables will (almost) repeat for each year that is lagged. Consider the Curse of Dimensionality.

When using an elastic net regularisation process with a generalised linear model, this doesn't seem like an issue as the process removes variables as part of the hyperparameter training process.

FTE variables

Model performance increased significantly when I split up the FTE variables from just Staff Group into Staff Group by setting (eg, acute, community trust and MH). Lots of the Community Trust and MH variables 0 for all time periods for a large number of ICSs.

Before I remove the variables, this is what the performance chart looks like (for 2 lagged years only):

image

Here is the performance chart after removing all of the Community Trust FTEs:

image

It looks like these variables have no impact on the prediction. They are most likely removed from the modelling during the regularisation process.

Variable importance

Should look into the most important variables further. Perhaps there is something incorrect in their data that is making them very predictive.

sebsfox commented 5 months ago

What variables are used in the final model?

This article is really useful to understand how glmnet is applied through {parsnip}. There are two values that are tuned in the hyperparameter tuning process; penalty (aka lambda - the coefficient applied to penalise the error terms with) and mixture (the extent of either lasso or ridge regression).

The hyperparameter tuning process selects the two parameters above that produce the best performance on the validation dataset. However, despite penalty (lambda) having an optimal value, when fitting the best model to the final training dataset, the glmnet() process still follows a "lambda path" of nlambda values, whose range will contain the optimal penalty (lambda). The parameters used in the final fit will be the ones that still remain when the lambda path is nearest the optimal penalty.

The following code helps understand the last_fit object more: