heidelbergcement / hcrystalball

A library that unifies the API for most commonly used libraries and modeling techniques for time-series forecasting in the Python ecosystem.
https://hcrystalball.readthedocs.io/
MIT License
152 stars 19 forks source link

[OTHER] Question: AR coefficients, sklearn lags understanding #65

Open lucheroni opened 3 years ago

lucheroni commented 3 years ago

Hello.

A very simple question.

Once I've estimated an autoregression as a wrapped linear model, say as

model_lr = get_sklearn_wrapper(LinearRegression, lags=10) mlr = model_lr.fit(X[:-10], y[:-10])

how do I get the autoregression coefficients?

I cannot find the equivalent of scikit .coef_ , like

mlr = LinearRegression().fit(X, y) mlr.coef_

Thanks.

pavelkrizek commented 3 years ago

Hello,

thanks for the question, I think that it's a good sign that we could extend some parts of the documentation which doesn't cover this.

All models have a model attribute which allows accessing the original estimator. In the case of sklearn the situation is a little bit more complicated because the actual fit happens only in the predict method (because the length of X passed to predict method is not known in the time of fitting). This is an implementation detail, but the fact is that you can access the coef_ only after calling predict method for get_sklearn_wrapper.

model_lr = get_sklearn_wrapper(LinearRegression, lags=10)
_ = model_lr.fit(X[:-10], y[:-10]).predict(X[-10:])
model_lr.model.coef_
lucheroni commented 3 years ago

Thank you for the swift answer. It works. Very interesting library!

I have another simple question, if you don't mind.

I'm specifically interested to the possibility of using Sklearn-API regressors as autoregressive models for time-series prediction, hence I need to understand well the hcrystalball data parsing model for time series.

My question refers to the section 'Autoregressive Modelling in Sklearn' ( https://hcrystalball.readthedocs.io/en/stable/examples/tutorial/wrappers/02_ar_modelling_in_sklearn.html)

and to the figure there (there must be two errors in it, x2 is skipped everywhere, and an x95 should be an x96 - second green line from the top, far right).

If you have say 100 datapoints, horizon 1 and lags 3, and e is the iid error, is it correct to say that, after calling get_sklearn_wrapper(LinearRegression), the y series is first organized as y3 = f(y0,y1,y2)+e, y4=f(y1,y2,y3)+e, and so on until y100=f(y97,y98,y99), then this matrix is passed to sklearn.linear_model.LinearRegression ?

Thank you.

On Sat, Jul 24, 2021 at 11:27 PM Pavel Krizek @.***> wrote:

Hello,

thanks for the question, I think that it's a good sign that we could extend some parts of the documentation which doesn't cover this.

All models have a model attribute which allows accessing the original estimator. In the case of sklearn the situation is a little bit more complicated because the actual fit happens only in the predict method (because the length of X passed to predict method is not known in the time of fitting). This is an implementation detail, but the fact is that you can access the coef_ only after calling predict method for get_sklearn_wrapper.

model_lr = get_sklearnwrapper(LinearRegression, lags=10) = (model_lr.fit(X[:-10], y[:-10]).predict(X[-10:]) model_lr.model.coef

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/heidelbergcement/hcrystalball/issues/65#issuecomment-886113273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOMFUN26RPDZ2F2LPDMWUBDTZMV37ANCNFSM5A5TM44Q .

-- Carlo Lucheroni School of Sciences and Technologies Universita' di Camerino via Madonna delle Carceri 9 62032 Camerino (MC) Italy Office phone: 39-0737402552

-- Ask me for my Mathematical Finance and Stochastic Dynamic Optimization courseware: 'Take the Chance!'

View and download my papers on Finance and Physics from ResearchGate: https://www.researchgate.net/profile/Carlo_Lucheroni/publications

View and download my research on power markets on my SSRN Author page: http://ssrn.com/author=1390778

You can also check IDEAS/RePEc for my name: https://ideas.repec.org/f/plu234.html https://ideas.repec.org/

MichalChromcak commented 3 years ago

Hi @lucheroni,

You are right, there should be x1, x2, x3, ... and later on x95, x96, x97 ; x96, x97, x98 etc. Thanks for spotting it.

And yes, in this case, y0, y1, y2 are passed as X to the LinearRegression and y3 as y. Or one could also formulate it relative to y as following:

MichalChromcak commented 3 years ago

@lucheroni, @pavelkrizek I updated the picture, but don't want to close it in case there are still some open questions. Can I close it as is? :)

lucheroni commented 3 years ago

I asked because I'm trying to check the in-sample predictions, and I'm getting a strange behavior. In the following snippet I define an AR(2) model with two parameters only (no constant), hence I'm expecting to get the first in-sample forecast as y2 = f(y0,y1). I get the following, which I don't understand:


ar_ord = 2
# horizon
hz = 1
# ten values
train = np.array([0.,1.,0.,1.,0.,1.,0.,1.,0.,1.])
y = train
date_index = pd.date_range(start="2020-01-01", periods=len(y), freq="h")
X = pd.DataFrame(index=date_index)
#
model_lry = get_sklearn_wrapper(LinearRegression, lags=ar_ord, fit_intercept=False)
mlry_preds = model_lry.fit(X[:-hz], y[:-hz])
model_lry.model.coef_ # array([0., 1.]) - two coeffs, ok
#
mlry_errs.predict(X[-1:]) # 1, 10th val - correct prediction
mlry_errs.predict(X[-2:]) # 1,0 shouldn't be 0,1?
mlry_errs.predict(X[-3:]) # 1,0,1
mlry_errs.predict(X[-4:]) # 1,0,1,0
mlry_errs.predict(X[-5:]) # 1,0,1,0,1 
mlry_errs.predict(X[-6:]) # 1,0,1,0,1,0 from 5th val on
mlry_errs.predict(X[-7:]) # ? 0,0,0,0,0,00 getting zeros only
'''
ambader commented 3 years ago

It's all right, the formula to predict an AR(2) variable is: xt=a1xt-1+a2xt-2

You might got confused as the x go back in time, therefore the array is 'reversed'. Predicting the first variable of your last prediction: x3 = a1x2 + a2x1 a1,a2 = [1.,0.] 1x0+0x1=0 and from then on, all predictions become 0.

used this code to show the differences:

for i in range(1,8):
    print("model_lry.predict(X[-"+str(i)+":])")
    print(model_lry.model.coef_)
    print(model_lry.predict(X[-i:]))

There you'll also see that the coefficients change. That is indeed another source of confusion as they are not labeled with the changing xt-1,xt-2. Hope that was what you wanted to understand.

lucheroni commented 3 years ago

Thank you for the clarification.

My point is related exactly to what you embedded in the code.

It seems that the library re-fits the data each time one calls .predict(), and it does it in a different way each time. This is at least what I get when I run the code fragment above - changing coefficients.

Usually, on the contrary, once you have prepared your design matrix, you fit the model once and the coefficients will stay the same whatever you do with the model after fitting. In facts, in statsmodels .predict() is called a postestimation method.

It seems to me that the library registers in the namespace a modellry.model.coef only after .predict(), and at that time it refits the model.

Am I wrong on that?

ambader commented 3 years ago

Finally found the problem. First thing to notice, "Actual model fitting is done in predict method since the way model is fitted depends on prediction horizon which is known only during predict call." see:help(model_lry.fit) Therefore, the wrapper fits new with every prediction, using the prediction target as horizon. But when I tried: model_lry = get_sklearn_wrapper(LinearRegression, lags=ar_ord, fit_intercept=False,optimize_for_horizon=False) the predicted coef were still altered. Digging in the code of wrappers/_sklearn.py finally revealed what I identify as the problem: When the prediction is called, is is tested for the bool of optimize_for_horizon, but at that point the _X for the fitting regression is already set - always with the len of predictionX as a horizon. I forked the code I used to change it, hope that it is correct.

pavelkrizek commented 3 years ago

@ambader Could you please supply an example that reproduces the problem you are facing? Thank you.

ambader commented 3 years ago

It's still the same problem this whole thread is about:

ar_ord = 2
hz = 10
train_me = np.array([0.,1.,0.,1.,0.,1.,0.,1.,0.,1.])
train_me=np.tile(train_me,100)
y = train_me
date_index = pd.date_range(start="2020-01-01", periods=len(y), freq="d")
X = pd.DataFrame(index=date_index)

for i in range(1,8):
    model_lry = get_sklearn_wrapper(LinearRegression, lags=ar_ord, fit_intercept=False,optimize_for_horizon=False)
    print("model_lry.predict(X[-9:-"+str(i)+":])")
    model_lry.fit(X[:-11],y[:-11])
    print(model_lry.predict(X[-9:-i]))
    print(model_lry.model.coef_)
pavelkrizek commented 3 years ago

Ok, this behavior is expected by the implementation as the model is a function of training data which are defined by the horizon. There are also other ways how to implement the regressor as a time series forecaster, each with its advantages and drawbacks. PRs for new implementations are welcomed :)

We had also initially horizon being explicitly passed to the get_sklearn_wrapper, so the behavior was more understandable and deterministic, but this made its usage less flexible and add additional complexity (i.e. what to do when the X passed to predict doesn't have the same length as provided horizon...)

lucheroni commented 3 years ago

I hope you'll be able to implement the method in a different way, if you want to push the scikit compatibility feature of your library to being used by the time series community.

As it is, the method is not compatible with the way econometricians work: in econometrics estimation is made in a given interval, and must be separated from prediction inference in other intervals (that can include or not the original estimtion interval), take statsmodels as the classic example. Imagine an estimation which requires a very long time (days), and that after it you need to predict your data in different time segments, at will. You save the fitted model then you explore its capability. You don't want to re-estimate it at each try.

The advantage that I thought I could find in hcrystalball was that it could be a nice interface for doing quick econometrics with regularization (not easy in python, whereas scikit includes ridge and lasso for static problems) or with static machine learning methods extended to time series. That could alleviate the need of writing interfaces by oneself, as it is usually done.

On Fri, Jul 30, 2021 at 11:25 AM Pavel Krizek @.***> wrote:

Ok, this behavior is expected by the implementation as the model is a function of training data which are defined by the horizon. There are also other ways how to implement the regressor as a time series forecaster, each with its advantages and drawbacks. PRs for new implementations are welcomed :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heidelbergcement/hcrystalball/issues/65#issuecomment-889764089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOMFUN6V5557KCIYRCOEH2TT2JVXTANCNFSM5A5TM44Q .

-- Carlo Lucheroni School of Sciences and Technologies Universita' di Camerino via Madonna delle Carceri 9 62032 Camerino (MC) Italy Office phone: 39-0737402552

-- Ask me for my Mathematical Finance and Stochastic Dynamic Optimization courseware: 'Take the Chance!'

View and download my papers on Finance and Physics from ResearchGate: https://www.researchgate.net/profile/Carlo_Lucheroni/publications

View and download my research on power markets on my SSRN Author page: http://ssrn.com/author=1390778

You can also check IDEAS/RePEc for my name: https://ideas.repec.org/f/plu234.html https://ideas.repec.org/

ambader commented 3 years ago

I understand why the horizon is set by default, but if I explicitly try to change it, it should not just ignore it without any comment. And if that behavior is intended, why does predict checks for optimize_for_horizon but not _predict ?