scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.51k stars 25.28k forks source link

LassoCV Does Not Find Best Lambda #1671

Closed burakbayramli closed 11 years ago

burakbayramli commented 11 years ago

I am on scikit-learn version 0.13-git. Here is the problem: The lambda value entered by hand 0.3 for Lasso performs much better than 0.013 found by LassoCV which utilizes crossvalidation. I used the standard diabetes data.

https://gist.github.com/burakbayramli/4750196

I based the code on this page

http://scipy-lectures.github.com/advanced/scikit-learn/index.html#sparse-models

Thanks,

amueller commented 11 years ago

Thanks for reporting. Does any of the linear model folks have time to look into this?

GaelVaroquaux commented 11 years ago

I'll try (not kow, I just landed back in Paris and will be happy to get home after travelling), but I believe that it's not a bug: it's the normal behavior of cross-validation. In addition, nothing garantees in real settings that the better performances of 0.3 do not simply reflect overfit.

----- Original message -----

Thanks for reporting. Does any of the linear model folks have time to look into this?


Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/1671#issuecomment-13355439

amueller commented 11 years ago

Could you please use train_test_split from sklearn.cross_validation to create the splits or manually shuffle the data? I am not sure that splitting without shuffling is a good idea.

ogrisel commented 11 years ago

Could you please use train_test_split from sklearn.cross_validation to create the splits or manually shuffle the data? I am not sure that splitting without shuffling is a good idea.

Indeed, also the size of the folds for LassoCV is 1/3 of the data which makes the training set / test size 295 / 147.

In your manual split you use 422 / 20 which is very different training size hence the optimal regularization is not the same. Try to increase the value of cv, for instance to 10 to check whether you converge to a similar solution.

burakbayramli commented 11 years ago

I see! When I tried

k_fold = cross_validation.KFold(n=400, k=10, indices=True) lasso = linear_model.LassoCV(cv=k_fold) X_diabetes = diabetes.data y_diabetes = diabetes.target print lasso.fit(X_diabetes, ydiabetes) print lasso.alpha

I see alpha reporting 0.3 which is the optimal value.