paroussisc / stats

Repo to track the material I cover on Wednesdays
MIT License
1 stars 0 forks source link

Toy Problem - Linear/Ridge/Lasso #1

Open paroussisc opened 6 years ago

paroussisc commented 6 years ago

This is an issue to keep track of how Linear Models, Ridge, Lasso perform on the following toy problem:

x <- data.frame(x1 = rep(0, N))
x$x1 <- runif(n = N)
x$x2 <- runif(n = N)
x$x3 <- x$x1 + runif(n = N)
x$x4 <- x$x2 + x$x3 + runif(n = N)
epsilon <- rnorm(n = N)
y <- x$x1 - x$x2 + epsilon
paroussisc commented 6 years ago

The first set of analysis is of a simulated sample of 50 data points - we will look at the effects of changing the sample size later.

The Linear Model has the following coefficients:

Coefficients:
   Estimate Std. Error t value Pr(>|t|)
x1   0.8930     0.7101   1.258    0.215
x2  -0.7311     0.7218  -1.013    0.316
x3   0.4187     0.7541   0.555    0.581
x4  -0.3732     0.4568  -0.817    0.418

None of which are coming out as significant, due to the correlation between the predictors. The MSE of this model is 0.95.

paroussisc commented 6 years ago

Ridge regression for differing values of lambda gives the following plot: ridge_plot

x1 and x2 have the largest coefficients here so the model is correctly picking up on the main predictors, but we could do with knocking out x3 and x4 completely. The value for lambda that gives the lowest cross-validation score on the training data is 0.52 and this gives MSE 1.04.

paroussisc commented 6 years ago

Using the Lasso gives the following plot for different values of lambda:

lasso_plot

Which correctly removes the x3 and x4 coefficients first, which is useful for interference on our data. The value for lambda that gives the lowest CV score is 0.08 and it gives an MSE of 1.04.

Both shrinkage methods appear worse on the training data but should generalise better to unseen test data. We look at prediction next.

paroussisc commented 6 years ago

For prediction, the Ridge coefficients using optimised lambda are:

x1  0.41235710
x2 -0.45437900
x3  0.15811501
x4 -0.05434451

And for Lasso:

x1  0.6014632
x2 -0.5791605
x3  .        
x4  .        

The MSE on our training data for the three models is:

Linear  0.93
Ridge   0.89
Lasso   0.89

Showing that the two shrinkage models perform better on this out of sample data. Next we will look at the distribution of the coefficients and MSE over many independent samples of toy data. Following this, we will look at MSE as a function of sample size.

paroussisc commented 6 years ago

Distributional results for 100 samples of 50 values, using Ridge regression: ridge_dist_coefs

So x1 and x2 tend to shrink less than x3 and x4, as expected, and x3 shrinks less than x4.

paroussisc commented 6 years ago

Distributional results for 100 samples of 50 values, using Lasso regression:

lasso_dist_coefs

Due to the L1 regularization, the coefs get shrunk towards zero often - frequently for x3 and x4, so we can see that x1 and x2 are the more useful predictors - lasso appears to have worked well in this example to identify the true predictors.

paroussisc commented 6 years ago

For 1000 repeated training samples of size 50, it appears that the models have similar MSE on the repeated test samples of size 50:

distr

This is just for one example of training set size - it is important to see how the MSE changes with the training set size in this example, so we can see what (if any) advantages Ridge and Lasso give.

paroussisc commented 6 years ago

Here we have the comparison of differing sample sizes - for each (training) sample size, we fit to 100 different train/test pairs, where the test sample is always of size 50. We then take an average of the MSE over the 100 train/test pairs to get the mean MSE for that sample size:

sample_size

In terms of MSE, the Lasso and Ridge models are more accurate when the training set is below 25 in number. Ridge regression and the simple linear model perform similarly with larger training sets.

As expected, the shrinkage methods are useful for small training samples, which we have shown here. It is likely that we would see even greater benefits for shrinkage methods in high-dimensional feature spaces.

paroussisc commented 6 years ago

The above analysis uses the shrinkage value that minimises the MSE, but I was interested to investigate the effects of using the one standard error rule (mentioned in The Elements of Statistical Learning - "We often use the “one-standard-error” rule when selecting the best model; this acknowledges the fact that the risk curves are estimated with error, so errs on the side of parsimony.", and spoken about here: https://stats.stackexchange.com/questions/80268/empirical-justification-for-the-one-standard-error-rule-when-using-cross-validat).

This leads to the MSE being higher, on the training set, but one would hope for better out of sample generalisation:

dist_1se

paroussisc commented 6 years ago

It appears not to perform better in the 50/50 train/test example, but it is worth looking at differences as the sample size changes:

sample_1se

The results above are not drastically different to using the shrinkage term that minimises the MSE - it could be argued that Ridge in low samples is improved but both the Ridge and Lasso worse with larger training sets.