scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

Alternatives to step-wise regularization #159

Open Fish-Soup opened 7 years ago

Fish-Soup commented 7 years ago

After our discussion in #157. I decided to read up more on the advantages/disadvantages of using other regularization techniques. I found

CMARS 2012

CMARS 2008

,where the authors use Ridge (of Tikhonov) regularization within MARS after the initial forward pass to avoid over fitting. This seems an interesting approach and as far as i know would allow warm starts if the initial forward pass was skipped ( in reference to #157). I'm not sure how significant a piece of work with would be but it certainly seems like an interesting option. I suppose i would hope it could be extended to contain an Lasso + Ridge regression hybrid, like Elastic Net, which takes the advantages of both methods.

Thanks again for all the good work

jcrudy commented 7 years ago

You can actually do this right now using a combination of py-earth and scikit-learn. Just pass enable_pruning=False to the Earth constructor, then put the Earth model in a Pipeline with a ElasticNet from scikit-learn. If you want to refit just the ElasticNet part without refitting the Earth part, though, then you'll have to do that manually. At least, I think so. You might look in the docs or write to the scikit-learn mailing list to make sure.

Fish-Soup commented 7 years ago

Ahh I don't think I quite understood how flexible pipelines where. I'll definitely take a look into this. Apart from the ability to warm start (after the MARS bit) can you see any advantages/disadvantages of either method. I assume using elastic net should be faster? (even with out a warm start).

Back to #157 do you have a feel for how much extra data i would have to add to change the output of the forward pass. I assume if i refitted with 1% more data it would have little effect? I suppose I'm asking if you know how it scales in general or would this be set by the problem at hand.

Many thanks

jcrudy commented 7 years ago

I don't know the advantages or disadvantages of using elasticnet vs the regular pruning method. You'll have to experiment. I would be curious to know the results.

Regarding stability of the forward pass, you can generally expect only slightly different predictions for small differences in data. However, the actual model could be substantially different. For example, if you have two highly correlated predictors, a small change in training input could affect which variable gets selected.

Fish-Soup commented 7 years ago

Hi so I've tried to put this into practice and its been a partial success. I have tried Earth=>Ridge, Earth=>Lasso and Earth => Elastic net pipelines. In the example case I made Ridge performs well, having similar accuracy to earth but taking half as long. LASSO performs less well, which makes sense as it will be removing some of the terms Earth created.

What I don't understand is that Elastic Net performs worse than both. There's a parameter in earth called l1_ratio which is supposed to define how like Lasso or Ridge it is. Setting l1_ratio to 1 does make earth act like lasso regression but setting it to 0 does not make it like Ridge. Which is odd.

I suppose something in elastic net is reacting poorly to the model MARS is sending it. Do you have any idea why this is? Are the hinge functions maybe exposed to Lasso and Elastic net? I know you are supposed to normalize the data but in my example all my xdata is normalized to start with.

Many thanks

My code is below

from sklearn.ensemble import AdaBoostRegressor from sklearn.metrics import r2_score from sklearn.cross_validation import train_test_split from sklearn.grid_search import GridSearchCV from pyearth import Earth import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet, Lasso, Ridge from datetime import datetime

import math

def func(X): return 10. np.exp(.8np.abs(X['x0']))+25.np.sin(X['x0']math.pi)

Generate a data set

np.random.seed(1) n = 100000 X = pd.DataFrame(np.random.normal(0.,1.,size=n),columns=['x0']) X['x1'] = np.random.uniform(0.,1.0,size=n) X['x2'] = np.random.normal(0.,1.0,size=n) X['x3'] = np.random.normal(0.,1.0,size=n) X['x4'] = np.random.normal(0.,1.0,size=n)

y = 10. * np.abs(X['x0'] - 10.) - \

20. X['x1'] (np.abs(X['x2']) + X['x3'] * np.abs(X['x4'])) + \

5. * np.random.normal(size=n)

y = func(X)+ 50. *np.random.normal(0.,1.0,size=n)

plt.plot(X['x0'],y,"bo") plt.show()

Split into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=1)

model = Pipeline([('earth',Earth()),('log',SGDClassifier())])

dt=datetime.now() earth_net_0 = Pipeline([('earth',Earth(max_degree=4,enable_pruning=False)),('enet',ElasticNet(l1_ratio=0.0,alpha=1.0))]) earth_net_0.fit(X_train,y_train) print "net= ",datetime.now()-dt

dt=datetime.now() earth_net_1 = Pipeline([('earth',Earth(max_degree=4,enable_pruning=False)),('enet',ElasticNet(l1_ratio=1.0,alpha=1.0))]) earth_net_1.fit(X_train,y_train) print "net= ",datetime.now()-dt

dt=datetime.now() earth_net= Pipeline([('earth',Earth(max_degree=4,enable_pruning=False)),('enet',ElasticNet())]) earth_net.fit(X_train,y_train) print "net= ",datetime.now()-dt

dt=datetime.now() earth_lasso = Pipeline([('earth',Earth(max_degree=4,enable_pruning=False)),('lasso',Lasso())]) earth_lasso.fit(X_train,y_train) print "lasso= ",datetime.now()-dt

dt=datetime.now() earth_ridge = Pipeline([('earth',Earth(max_degree=4,enable_pruning=False)),('ridge',Ridge())]) earth_ridge.fit(X_train,y_train) print "ridge= ",datetime.now()-dt

dt=datetime.now() earth_model = Earth(max_degree=4) earth_model.fit(X_train,y_train) print "earth= ",datetime.now()-dt

dt=datetime.now() earth_overfit =Earth(max_degree=4,enable_pruning=False) earth_overfit.fit(X_train,y_train) print "overfit earth= ",datetime.now()-dt

Y_test_predict_earth_net=earth_net.predict(X_test) Y_test_predict_earth_net_0=earth_net_0.predict(X_test) Y_test_predict_earth_net_1=earth_net_1.predict(X_test) Y_test_predict_earth_lasso=earth_lasso.predict(X_test) Y_test_predict_earth_ridge=earth_ridge.predict(X_test) Y_test_predict_earth=earth_model.predict(X_test) Y_test_predict_earth_overfit=earth_overfit.predict(X_test)

print "~~~"

print 'Earth Overfit score %f' % r2_score(y_test, Y_test_predict_earth_overfit) print 'Earth Score: %f' % r2_score(y_test, Y_test_predict_earth) print 'Earth Ridge score %f' % r2_score(y_test, Y_test_predict_earth_ridge) print 'Earth Lasso Score: %f' % r2_score(y_test, Y_test_predict_earth_lasso) print 'Earth Net Score: %f' % r2_score(y_test, Y_test_predict_earth_net)

plt.plot(X_test['x0'],func(X_test),"k.",label="true data")

plt.plot(X_test['x0'],Y_test_predict_earth_overfit,"ro",label="earth overfit")

plt.plot(X_test['x0'],Y_test_predict_earth,"go",label="earth")

plt.plot(X_test['x0'],Y_test_predict_earth_lasso,"cx",label="earth lasso") plt.plot(X_test['x0'],Y_test_predict_earth_ridge,"mx",label="earth ridge") plt.plot(X_test['x0'],Y_test_predict_earth_net,"b.",label="earth net") plt.plot(X_test['x0'],Y_test_predict_earth_net_0,"g.",label="earth net_0") plt.plot(X_test['x0'],Y_test_predict_earth_net_1,"r.",label="earth net_1")

plt.legend() plt.show()

jcrudy commented 7 years ago

@Fish-Soup I'm not super familiar with scikit-learn's elastic net, but if it's expecting normalized data then you might try putting a StandardScaler in your pipeline between the Earth and ElasticNet steps. Normalized inputs to a MARS model will not generally result in normalized basis functions.

Fish-Soup commented 7 years ago

So I've just tried inserting StandardScaler into the pipelines but it doesn't seem to have helped. i think this is the right track though

earth_net= Pipeline([('earth',Earth(max_degree=4,enable_pruning=False)),('std_scal',StandardScaler()),('enet',ElasticNet())])

So as i understand it Lasso adds a regularization parameter that's the sum of the absolute values of each fitted parameter e.g |p1|+|p2| if we had a 2 parameter fit. Ridged does the sqrt(p1^2+p2^2). Lasso has no cross components and is good at choosing between two different variables e.g in the above case it would pick p1 or p2 but not both. Ridge lacks the capacity to ever remove any terms but (can take them close to 0). Its better when you have correlated variables, as it doesn't just pick one of the other but can contain a bit of both.

Elastic net is a combination of Ridge and Lasso where by the user can choose how much of one or the other to add. I was planning on using grid search to tune this. My confusion is i can tune this parameter so that it basically should make elastic net = ridge. But it doesn't..... Although i can make it the same to Lasso.

A consideration for all the above is that your xdata should be normalized so that mean =0 and std is constant for all data so that the scale of p1 and p2 is consistent (else certain terms will be chosen over the other just because the data has a larger range). Although all my xdata is normalized I am not sure if whatever earth exposes via the pipeline to ridge, elastic net or lasso is. Is it just fitted and before fitted data? or does it include hinges or something like that?

I find pipelines a bit hard to understand what is actually going on... many thanks

jcrudy commented 7 years ago

@Fish-Soup I'm not sure why the Elastic Net isn't behaving as expected. Perhaps the scikit-learn people can give you advice on that. In general, you can perform the steps of the Pipeline manually as follows (code untested):

model1 = Earth()
model1.fit(X, y)
model2 = StandardScaler()
model2.fit(model1.transform(X), y)
model3 = ElasticNet()
model3.fit(model2.transform(model1.transform(X)), y)
y_pred = model3.predict(model2.transform(model1.transform(X)))

That's really all it does. You can use this breakdown to debug your pipeline more easily.

Fish-Soup commented 6 years ago

Cool sorry to take so long to reply. It appears that one of the columns generated in model1.transform(X) is a set of one's. Which obviously can't be normalised. This doesn't seem to make much sense to me now I'm putting it down in writing. I'll check again in the morning.

jcrudy commented 6 years ago

@Fish-Soup One of the out columns from an Earth model's transform method is a constant, representing the intercept term. If that's what's causing problems, you could add a stage to remove it, as it's unnecessary as an input to the ElasticNet.

Fish-Soup commented 6 years ago

so It turns out my mistake was assuming that the alpha parameter in elastic net was the same as the one in ridge regression. because of the way it's structured.

fitting mars net is twice as fast as mars in the case we know the hyperparameters. doing a grid search over 100 by 10 for the 2 hyper parameters results in consistent fits between approaches.

in the case where not much regularization is required Mars is faster. in the case where there is lots of regularization mars net is faster.

as I am updating my models daily I plan to store my hyper parameters and periodicly test them. thus mars net will be faster. I'm also tempted to not refit the mars component of mars bet every time. I'm which case it will be factors of 100 faster.