scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.97k stars 25.38k forks source link

Why the estimated Lasso coefficients of almost all variables are equal to zero? #8497

Closed shahlaebrahimi closed 7 years ago

shahlaebrahimi commented 7 years ago

I would greatly appreciate if you could guide me. In fact, I used "Bayesian Optimization " to tune hyper-parameters of Lasso but the estimated Lasso coefficients of almost all variables are equal to zero.

X = np.array(pd.read_csv('my_X_table1-1c.csv',header=None).values)
y = np.array(pd.read_csv('my_y_table1-1c.csv',header=None).values.ravel())

ln=X.shape 
names = ["x%s" % i for i in range(1,ln[1]+1)]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

def Lassocv(alpha):
    return cross_val_score(Lasso(alpha=float(alpha), random_state=42),
                           X_train, y_train, scoring='neg_mean_absolute_error', cv=5).mean()

if __name__ == "__main__":

    LassoBO = BayesianOptimization(Lassocv, {'alpha': (0, 8)})
    LassoBO.maximize(init_points=2, n_iter = 30)
    print('Final Results')
    print('Lasso: %f' % LassoBO.res['max']['max_val'])
    alpha=LassoBO.res['max']['max_params']['alpha']

lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
print('train Score:', lasso.score(X_train, y_train))
print('test Score:', lasso.score(X_test, y_test))
print('train MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('test MSE:', mean_squared_error(y_test, lasso.predict(X_test)))
print('train MAE:', mean_absolute_error(y_train, lasso.predict(X_train)))
print('test MAE:', mean_absolute_error(y_test, lasso.predict(X_test)))

print("Features sorted by their score:")

# A helper method for pretty-printing linear models
def pretty_print_linear(coefs, names=None, sort=False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst, key=lambda x: -np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                      for coef, name in lst)

print("Lasso model:", pretty_print_linear(lasso.coef_,names=names,sort=True))

output:

Final Results
Lasso: -0.786422
train Score: 0.476819501615
test Score: 0.459836314561
train MSE: 1.10511096023
test MSE: 1.02388727356
train MAE: 0.681164633717
test MAE: 0.663119930613
Features sorted by their score:
Lasso model: 0.001 * x25 + 0.0 * x54 + 0.0 * x48 + -0.0 * x17 + 0.0 * x31 + 0.0 * x12 + -0.0 * x15 + -0.0 * x34 + 0.0 * x1 + 0.0 * x2 + -0.0 * x3 + -0.0 * x4 + 0.0 * x5 + 0.0 * x6 + 0.0 * x7 + 0.0 * x8 + 0.0 * x9 + 0.0 * x10 + 0.0 * x11 + 0.0 * x13 + -0.0 * x14 + 0.0 * x16 + 0.0 * x18 + -0.0 * x19 + 0.0 * x20 + -0.0 * x21 + 0.0 * x22 + 0.0 * x23 + -0.0 * x24 + -0.0 * x26 + 0.0 * x27 + -0.0 * x28 + -0.0 * x29 + 0.0 * x30 + 0.0 * x32 + 0.0 * x33 + 0.0 * x35 + 0.0 * x36 + -0.0 * x37 + 0.0 * x38 + 0.0 * x39 + -0.0 * x40 + -0.0 * x41 + -0.0 * x42 + 0.0 * x43 + 0.0 * x44 + 0.0 * x45 + 0.0 * x46 + -0.0 * x47 + -0.0 * x49 + -0.0 * x50 + -0.0 * x51 + 0.0 * x52 + 0.0 * x53 + 0.0 * x55 + 0.0 * x56 + 0.0 * x57 + 0.0 * x58 + 0.0 * x59 + 0.0 * x60

Best regards,

nelson-liu commented 7 years ago

Hi, Usage questions like this aren't really appropriate for the issue tracker --- try stackoverflow or crossvalidated in the future. However, this is expected. The Lasso uses L1 regularization, which causes it to have sparse solutions (some of the coefficients are driven to 0). It's commonly used for feature selection, since the features that correspond to the non-zero coefficients are hence "selected".

see the user guide and L1 based feature selection for more details.

jmschrei commented 7 years ago

LASSO uses the L1 penalty to try to force as many coefficients to 0 as possible as a type of feature selection. It seems that the optimization is choosing a high alpha, forcing the values to be very small. As a sanity check you might try calculation the pearson correlation between all of your variables and the output to see if x25 is one of the higher ones.

[EDIT: @nelson-liu and I posted at the same time, but his suggestion of looking at the user guide is a good one]