Open sebsfox opened 5 months ago
glmnet
When moving the logistic regression code over to glmnet
and then apply the tune()
function to selecting the penalty
and mixture
hyperparameters, I start to get errors when fitting the model. This looks to be because the outcome variable is a 2 column matrix c(remainder, numerator)
. I have narrowed this down to an issue with glmnet
rather than tune()
.
outcome = c(remainder, numerator)
compared with outcome = value
Using the same predictors, I have worked through fitting a model to predict both val
and c(remainder, numerator)
(both using tidymodels
and raw glm
). Fitting a model to predict val
produces a model with different coefficients to a model predicting c(remainder, numerator)
- though the different modelling approaches are consistent with each other (eg, tidymodels to predict val gives the same model as the glm model to predict val).
By using case weights, as shown in this file, I've managed to make the two approaches consistent. This has been implemented in the main modelling code in this commit: 861fbdeb645d68a95f12ab7d591225a6c10bf43c
glmnet
observationsAfter updating the modelling code to use glmnet
along with hyperparameter tuning for the penalty
and mixture
variables, I can see when testing on 2019 data (eg, avoiding the pandemic period), the performance of the model improves enormously. For predicting 2022 data, the performance is best when only using 2 years of data (eg, 2020 and 2021, but then deteriorates if using data prior to the pandemic).
When the years 2020 and 2021 are removed from the training data to predict 2022 data, the performance of the model deteriorates:
The number of training records compared with the validation records, up to now, has been dependent on the length of the training years. It has been calculated as follows:
$$\frac{n{t{x-1}}}{n_{t1...t{x-2}}}$$
where n = number of records, t = year, x = year that models are predicting (eg, test data). For example, when using 2020 and 2021 data to predict 2022 data, there are two years of data in the training set, and these have been split 50:50. When there are more years in the training set, the ratio of training to validation increases.
Here I have fixed the ratio of the count of training to validation records to 75:25.
In addition, the predictor variables that have zero variance are manually removed before the normalisation step occurs.
These steps significantly improves performance when predicting 2022 data:
Moving to cross validation as a method of improving hyperparameter tuning should make the training process more robust - especially because I have a small training dataset and I can stratify the folds by the outcome variable. It doesn't seem to have much of an improvement though on the previous performance:
I receive two warnings when running the glmnet model:
The argument λ is a coefficient in the glmnet equation. When tuning the mixture
and penalty
hyperparameters, this coefficient increases between two values, and model convergence is assessed. The glmnet
function has an argument called nlambda
which defaults to 100. This controls the maximum number of λ to assess model convergence over. If the number is too small, the steps between each λ will be bigger, and could result in difficult when trying to achieve convergence. This commit (2219f933c65247fb5d1cc4fd56d449b00f492710) increases the value for nlambda
, which results in more models converging and the warning not appearing as often. I suppose the risk of this could be that models could be deemed to converge earlier than they otherwise would. When changing the value of nlambda
from 100 to 600, the runtime increased, more models converged, but the performance didn't increase. Therefore it seemed there is little benefit of doing this.
estimate
is constant and has 0 standard deviation, resulting in a divide by 0 error. NA
will be returned.This occurs when the model predicts the same value for every record. When incorporating lasso regression, this is likely to occur when all of the predictor variables have been removed as a result of the penalisation process, just leaving the intercept. My question is now, why does this occur so often and how do I know whether the final model is ok?
In order to answer this I am describing the current situation so I can clarify my thoughts:
glmnet
algorithm generates a "path" of lambda values that it fits models along. The "path" can be auto-generated by the function itself, or it can be supplied. Additionally, the user can also specify the number of lambda (nlambda
) values that comprise the path (defaults to 100). This means that, when nlambda
is 100, there will be 100*nrow(tuning_grid)
models fitted to each fold. I have worked through the detail of this and can see where the warning occurs. When the models are being fitted along the path of λ values - once the value of λ is suitably high, it is possible to see that the "estimate" (or coefficient) of every predictor variable is 0 other than the "(Intercept)" term. As a result, at these high λ values, an R2 can't be calculated and the warning is shown.
In this commit (fd615c8b8108a29ead6d34f3b306405e387ac48f) I have removed the option of R2 as an evaluation metric. This has been done because it doesn't give well shaped predictions when compared with the observed data (see below). As a result, the warning has gone.
Up until now, I have ordered the observations and separated the final year of data for testing. In theory, this was done with practical application in mind. I thought that if we could build a model on the years prior to the year that we want to use the model for, and the model performed well, then it would give users confidence in the model and it would encourage them to use it more. In practice, say I am testing data on 2022 based on data from 2020 and 2021, then we couldn't implement that 2020/2021 model for 2023 because we'd be missing a whole year of information. By mixing all of the years, this uses all of the latest information to build a model to make it as relevant for application as possible. It hugely improves performance and shows that the more information you feed into the model the better it performs:
The current performance of the models on the train, validation and test dataset is very good (best performing models have an R2 of 0.95, which is a little unbelievable).
This commit adds a script which demonstrates strong modelling performance, along with tweaking the mixture
argument for the glmnet function to 0 (pure ridge regression) and 1 (pure lasso regression) to show that each of them perform as expected (ridge regression still retains every variable but lasso regression removes variables).
If there are identical observations it makes it easy for the model to perform well. A way to test this is to perform "leave-p-out-cross-validation", where you can leave a group out in each fold of the cross validation process. This commit implements this change, which has the effect of slightly lowering the performance of the model, but the performance is still very good:
Useful to understand the variation in performance for the different folds in the validation set. The chart below has points plotted on it for each of the folds (eg, one fold is where one region is omitted for training and is used for validating). The points displayed are for where the hyperparameters are optimised by the average over all of the folds:
The target variable has a strange distribution. The distribution has two humps, which are most likely COVID related, but need further investigation to make sure they are being calculated properly:
I'm currently using R2 as the metric to tune the model on, but it gives an unusual shape when plotting observed versus predicted:
Changing the error metric to RMSE, we get a better overall picture of observed versus expected:
Need to investigate other metrics.
Is there something wrong with including the COVID variables. For many of the years their value is 0 for all ICSs (though this is centred and scaled). Will this have a strange effect on the modelling? At the moment, they don't come out particularly high when looking at variable importance.
Is there anything wrong with having a dataset that is wider than it is long? At the moment, there are 378 observations and 112 variables in the full dataset (train + val + test). When the function performs the modelling, the training set will be a subset of those observations. If we include lagged years, the number of variables will (almost) repeat for each year that is lagged. Consider the Curse of Dimensionality.
When using an elastic net regularisation process with a generalised linear model, this doesn't seem like an issue as the process removes variables as part of the hyperparameter training process.
Model performance increased significantly when I split up the FTE variables from just Staff Group into Staff Group by setting (eg, acute, community trust and MH). Lots of the Community Trust and MH variables 0 for all time periods for a large number of ICSs.
Before I remove the variables, this is what the performance chart looks like (for 2 lagged years only):
Here is the performance chart after removing all of the Community Trust FTEs:
It looks like these variables have no impact on the prediction. They are most likely removed from the modelling during the regularisation process.
Should look into the most important variables further. Perhaps there is something incorrect in their data that is making them very predictive.
This article is really useful to understand how glmnet
is applied through {parsnip}. There are two values that are tuned in the hyperparameter tuning process; penalty
(aka lambda - the coefficient applied to penalise the error terms with) and mixture
(the extent of either lasso or ridge regression).
The hyperparameter tuning process selects the two parameters above that produce the best performance on the validation dataset. However, despite penalty
(lambda) having an optimal value, when fitting the best model to the final training dataset, the glmnet()
process still follows a "lambda path" of nlambda values, whose range will contain the optimal penalty
(lambda). The parameters used in the final fit will be the ones that still remain when the lambda path is nearest the optimal penalty
.
The following code helps understand the last_fit
object more:
ft$.workflow[[1]]$fit$fit$fit$lambda
(note, `ft$.workflow[[1]]$fit$fit$fit
shows the degrees of freedom at each lambda value, which is equivalent to the number of variables remaining in the modelft$.workflow[[1]]$fit$fit$spec$args$penalty
glmnet
to penalise variables that have a detrimental effect on the modelling