scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 122 forks source link

Fixed knot points (in one-dimensional case) #84

Closed 1pakch closed 8 years ago

1pakch commented 8 years ago

Thanks for the excellent work. I am using the package to estimate intraday volatility profiles (the model is "seasonality coefficient ~ time of the day") and sometimes the model does not pick-up obvious change points like closing/reopening times of the market.

Judging by out-of-sample predictions it is clear that having these breakpoints in a model a priori would improve out-of-sample performance. I am wondering if imposing knot points a priori might be a desired feature for the package? If yes, I would be willing to work on it.

jcrudy commented 8 years ago

@aickley, thanks for the feedback. If I understand correctly, I think what you want can be accomplished by pre-processing your data before feeding it into py-earth. For example, if you have some volatility values in one column and a market open/closed indicator in another column, you could do MARS/regression against a third column that you calculate from those two. Does that work for your use case, or is there something I'm missing? Perhaps there are benefits to having the Earth object keep track of this information?

In general, I like the idea of having more control over what kinds of relationships py-earth will use. I haven't thought a lot about exactly what that looks like, however. I would appreciate ideas.

1pakch commented 8 years ago

@jcrudy, thank you for the answer. I am not sure I understand your suggestion. Now I am regressing a volatility feature on time of the day feature which yields a seasonality profile of the volatility. What I would like to do is to ensure that a new segment starts in the period when market is closed:

download

I need to read up on the method to better understand how having fixed points could fit in there.

jcrudy commented 8 years ago

@aickley I was suggesting something like the following. Say your data set is in a pandas dataframe df with your volatility feature in the v column, and you want to make sure some list of knots knots is considered. You could do something like this (untested, hastily written):

for i, k in enumerate(knots):
    df['v_%d_+' % i] = numpy.maximum(0, df['v'] - k)
    df['v_%d_-' % i] = numpy.maximum(0, k - df['v'])

and then fit your model on the resulting df. There are other alternative ways of doing it, too. You're not restricted to using MARS style hinge functions.

Does that example satisfy your use case? I could think of some reasons why it might not. I'd like to understand better so I can make sure what you're working on can be addressed. I suspect you aren't the only one who wants to do what you want to do.

jcrudy commented 8 years ago

One feature idea might be to pass Earth a list of candidate knot locations for each input variable. Or even a weighted list.

jcrudy commented 8 years ago

Closing this for now.