Open paroussisc opened 6 years ago
The first set of analysis is of a simulated sample of 50 data points - we will look at the effects of changing the sample size later.
The Linear Model has the following coefficients:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x1 0.8930 0.7101 1.258 0.215
x2 -0.7311 0.7218 -1.013 0.316
x3 0.4187 0.7541 0.555 0.581
x4 -0.3732 0.4568 -0.817 0.418
None of which are coming out as significant, due to the correlation between the predictors. The MSE of this model is 0.95
.
Ridge regression for differing values of lambda gives the following plot:
x1
and x2
have the largest coefficients here so the model is correctly picking up on the main predictors, but we could do with knocking out x3
and x4
completely. The value for lambda that gives the lowest cross-validation score on the training data is 0.52
and this gives MSE 1.04
.
Using the Lasso gives the following plot for different values of lambda:
Which correctly removes the x3
and x4
coefficients first, which is useful for interference on our data. The value for lambda that gives the lowest CV score is 0.08
and it gives an MSE of 1.04
.
Both shrinkage methods appear worse on the training data but should generalise better to unseen test data. We look at prediction next.
For prediction, the Ridge coefficients using optimised lambda are:
x1 0.41235710
x2 -0.45437900
x3 0.15811501
x4 -0.05434451
And for Lasso:
x1 0.6014632
x2 -0.5791605
x3 .
x4 .
The MSE on our training data for the three models is:
Linear 0.93
Ridge 0.89
Lasso 0.89
Showing that the two shrinkage models perform better on this out of sample data. Next we will look at the distribution of the coefficients and MSE over many independent samples of toy data. Following this, we will look at MSE as a function of sample size.
Distributional results for 100 samples of 50 values, using Ridge regression:
So x1 and x2 tend to shrink less than x3 and x4, as expected, and x3 shrinks less than x4.
Distributional results for 100 samples of 50 values, using Lasso regression:
Due to the L1 regularization, the coefs get shrunk towards zero often - frequently for x3 and x4, so we can see that x1 and x2 are the more useful predictors - lasso appears to have worked well in this example to identify the true predictors.
For 1000 repeated training samples of size 50, it appears that the models have similar MSE on the repeated test samples of size 50:
This is just for one example of training set size - it is important to see how the MSE changes with the training set size in this example, so we can see what (if any) advantages Ridge and Lasso give.
Here we have the comparison of differing sample sizes - for each (training) sample size, we fit to 100 different train/test pairs, where the test sample is always of size 50. We then take an average of the MSE over the 100 train/test pairs to get the mean MSE for that sample size:
In terms of MSE, the Lasso and Ridge models are more accurate when the training set is below 25 in number. Ridge regression and the simple linear model perform similarly with larger training sets.
As expected, the shrinkage methods are useful for small training samples, which we have shown here. It is likely that we would see even greater benefits for shrinkage methods in high-dimensional feature spaces.
The above analysis uses the shrinkage value that minimises the MSE, but I was interested to investigate the effects of using the one standard error rule (mentioned in The Elements of Statistical Learning - "We often use the “one-standard-error” rule when selecting the best model; this acknowledges the fact that the risk curves are estimated with error, so errs on the side of parsimony.", and spoken about here: https://stats.stackexchange.com/questions/80268/empirical-justification-for-the-one-standard-error-rule-when-using-cross-validat).
This leads to the MSE being higher, on the training set, but one would hope for better out of sample generalisation:
It appears not to perform better in the 50/50 train/test example, but it is worth looking at differences as the sample size changes:
The results above are not drastically different to using the shrinkage term that minimises the MSE - it could be argued that Ridge in low samples is improved but both the Ridge and Lasso worse with larger training sets.
This is an issue to keep track of how Linear Models, Ridge, Lasso perform on the following toy problem: