scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

Fitting MARS with time series data #157

Open Fish-Soup opened 7 years ago

Fish-Soup commented 7 years ago

So the dataset I use gets updated daily to contain new data. I then run MARS on the resulting data. Obviously with only one extra data point the chances of the model being very different day on day are small. Is there away to use the prevopus model ad an initial guess to speed up computation? Cheers

Fish-Soup commented 7 years ago

I should have said "Is there a way to use the previously fitted model as an initial guess for the model with one more data point" in order to speed up computation.

jcrudy commented 7 years ago

@Fish-Soup No, not currently. An option would be to just redo the pruning pass and linear fit with the new data, but not the forward pass. That means your knot locations would not change, but coefficients and pruning might. This would not be expected to give as good a fit as starting from scratch, however.

Fish-Soup commented 7 years ago

How much time would this save? Is this something I can just do accessing some of the private methods in the Earth object?

I suppose I could fit and see if the R^2 increases by more than some tolerance and if it does re-compute with the full forward pass etc.

I am open to any other suggestions for solving the issue, essentially i think the fact that i am predicting the future and that the sensitivity of gas/power demand may change in the future. I wanted to asses how big an impact this is by successively fitting the model with the data it would have had at hand for a given prediction point. However the successive fitting is just far to slow. I have a few approaches for predicting the future (2 MARS with different parameters) and an analytic formula and i would really like to compare them. Do you think it might be any faster to successively fit MARS within the module itself? e.g. i provide a mapping of fit to predict data and get MARS to fit and predict within the code. I would have to do less loops in python which does slow it down. I'm thinking though its the actual fitting of the model which is taking up most of the slow down.

The other thought I had was somehow on the short term modelling how the model evolves with more data. i suppose i could fit a MARS model on the difference between two mars models fitted with successively more data. This would still be time consuming but in this case i thought i might pickle MARS models as they are generated and read them into the file. Not sure if you have any ideas of a better approach?

jcrudy commented 7 years ago

@Fish-Soup The Earth class has methods forward_pass, pruning_pass, and linear_fit, which allow you to separate fitting into its different stages. How much time skipping the forward pass saves is dependent on your data, but I would guess about a 2x speed up.

I don't fully understand your situation, so I could be off here. Both of your ideas--doing partial refits and modeling differences--seem potentially workable in practice. However, if you really want to compare the methodologies, you are putting MARS at a disadvantage by doing these things, since the fit will not necessarily be as good. If I'm correct in understanding that the goal is comparing the methodologies, then, if it's feasible, I would recommend simply using more computer time. You could fit multiple MARS models simultaneously on different machines, for example using EC2, and then compare them, or just wait longer. I don't know, though, how feasible this is in your situation.

Do you think it might be any faster to successively fit MARS within the module itself? e.g. i provide a mapping of fit to predict data and get MARS to fit and predict within the code. I would have to do less loops in python which does slow it down. I'm thinking though its the actual fitting of the model which is taking up most of the slow down.

I'm not sure I understand this part.

Fish-Soup commented 7 years ago

It appears in my attempt to be to the point i was just unclear. :) I am trying to simulated Gas/Electricity Demand So I've set up my code to be able to run several approaches on different sets of data. I have currently 2 versions of MARS with different Xdata and one analytic formula that is provided by the grid. My code currently takes a weighted average of these three attempts that minimized the imbalance cost (which is proportional to the absolute error). I order to fairly get these weights I need to back test in the same way as i will receive data in the future. For the market I'm working in currently there is a two week delay between delivering gas/electricity and receiving the allocation (which informs my y value). As such in order to get my weights i need to produce a set of predictions which have access to all the data two weeks and older. As I have to run the forecast daily re-modelling and calculating all these yvalues takes far to long. It is true i could save them down and only calculate new yvals but i am hesitant to do this as the model is still a work in progress and if I change something I don't want to have to go back and re-calculate all the historic yvals. I suppose i was hoping ideally that like some regularization methods e.g like Elastic Net, you can give the fit an initial guess. This way each day when I iterate I add one more day of data to the model, and use the old model as an initial guess. I assume the model would only change marginally. It appears this is not really possible with MARS as it uses step-wise regression rather than an regularization parameter. I thought maybe I could run the forward pass once a week. and otherwise just run pruning or maybe even just the fit?

I appreciate that doing any of the above puts MARS at a disadvantage, but currently I'm left with applying the weights using in sample testing which is not ideal either. I've tried using fast MARS and its still to slow as i have to run the model about 500 times per customer ( i have about 500 days of data).

My comment you didn't understand was that i wondered if some of what i was trying to achieve would be better done withing the MARS algorithm itself, as it could be written in cython/c? I imagine time series forecasting like i am doing is a fairly common need. In order to facilitate what this i imagined i might provide maybe a dictionary with a list in the values where the key of the dict is some model ID, and the values are the indexes of the data values that can be fit on it. then in the predict step you might pass a dict that specified which model to use to predict each set of data. I suppose you could use this for other in and out of sample testing.

I suppose i just hoped doing this inside the algorithm might be faster.

Fish-Soup commented 7 years ago

also good shout on E2C will look into that

jcrudy commented 7 years ago

@Fish-Soup I understand now. Yes, because of the stepwise nature of MARS, there is no existing method to adjust a fitted model to new training data. If you really need to do hundreds or thousands of model fits, it's going to take a lot of computer time. You might think about whether you can avoid all that model fitting. The various approximations you mentioned may be workable, or you may be able to make other assumptions or approximations. Depending on which features of py-earth you need and whether you want to use R, you could also try the R earth package, which is significantly faster (about 5x the speed, last time I checked).

Fish-Soup commented 7 years ago

Thanks for the help, I didn't realize that the R version was faster. I would personally rather stick in python as my intention is to try use some neural network techniques from scikitlearn at some point. I think what i will probably do is save down old model's and their fits so I don't have to regenerate them. But i may come back to this, when the models are a little more developed. I think maybe its taking MARS longer to fit power than gas as the relationships with available data are less clear, with gas i may be able to use the approach i wanted to. I'm also hoping to get access to a server at work so i would have better leverage which may help get rid of lots of these faults.

Fish-Soup commented 6 years ago

My current idea is that every time I get new data I try to predict it using a previously run model. If the error is less than the median in sample error i will assume the new data point would not alter the fit. If it was poorly predicted I would refit the model.

Any draw backs to this approach?

Fish-Soup commented 6 years ago

My current idea is that every time I get new data I try to predict it using a previously run model. If the error is less than the median in sample error i will assume the new data point would not alter the fit. If it was poorly predicted I would refit the model.

Any draw backs to this approach?

jcrudy commented 6 years ago

@Fish-Soup The only issue I see is that you need to make sure you accumulate all new data points and use them all when you do the refit. Otherwise, you'll create substantial bias toward these larger residual cases. I'm guessing this is what you mean. You'll still have a little bias from this approach, but if your initial data set is sufficiently large then I doubt it matters. On the other hand, a non-biased batch approach, say just refitting once a month or something, might be just as good in practice. If you have timestamps on your data points, you could run simulations to determine the performance of these strategies.

Fish-Soup commented 6 years ago

Yeah I will refit with all available data at any time I decide to refit. having now also put into practice MARS Elastic Net I think I can basically fit it once to get an overfit MARS model and the elastic bet hyper parameters then refit the elastic bet bit every time. this should be fast. I can then periodicly or refit the rest only when the above limit are reached.

Fish-Soup commented 6 years ago

Hi

Having incorpated MARS-ElasticNet where I use Elastic net for the pruning pass i have substantially sped up the process as i can store the ideal hyperparameters for elastic net and once this is done it is very fast.

I do still have sometimes an issue with the time taken in the forward pass. I take it, that there is no possible way of warm starting the forward pass such that the initial knot positions and basis functions are used as an initial guess?

Cheers

jcrudy commented 6 years ago

@Fish-Soup While it might be possible theoretically, there's currently no algorithm that I know of to do it, much less any implementation. You might consider fitting a new model on the residual rather than refitting your existing model, but you'd have to experiment with that.