heidelbergcement / hcrystalball

A library that unifies the API for most commonly used libraries and modeling techniques for time-series forecasting in the Python ecosystem.
https://hcrystalball.readthedocs.io/
MIT License
152 stars 19 forks source link

[OTHER] Feedback on reduction to regression `get_sklearn_wrapper` #46

Open mloning opened 3 years ago

mloning commented 3 years ago

Hi, following our conversion yesterday, I have two questions about the get_sklearn_wrapper:

1. API: Why do you dynamically create the init and pass the class rather than simply using composition and passing the object?

forecaster = get_sklearn_wrapper(DummyRegressor, lags=3)  # you do this
forecaster = get_sklearn_wrapper(DummyRegressor(), lags=3)  # why not this?

2. Algorithm: Why not use standard recursive strategy?

from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.dummy import DummyRegressor
import pandas as pd
import numpy as np

index = pd.date_range("2000", periods=13, freq="Y")
y_train = pd.Series(np.arange(10), index=index[:-3])
X_train = pd.DataFrame(index=y_train.index)
X_test = pd.DataFrame(index=index[-3:])

model = get_sklearn_wrapper(DummyRegressor, lags=3)
model.fit(X_train, y_train)
model.predict(X_test)
# >>> 2010-12-31    7.0
# >>> 2011-12-31    7.0
# >>> 2012-12-31    7.0

# you use the first 3 values as lagged variables, the DummyRegressor simply computes the mean of the rest
# so shouldn't the result be the following?
y_train.iloc[3:].mean()  # >>> 6.0

# the problem seems to be in the way you generate the target series 
# you increase the gap between lagged variables and target to match the length of 
# forecasting horizon 
X, y = model._transform_data_to_tsmodel_input_format(X_train, y_train, len(X_test))
pd.concat([X, pd.Series(y, index=X.index, name="y")], axis=1).head()
#   lag_0   lag_1   lag_2   y
#5  2.0 1.0 0.0 5
#6  3.0 2.0 1.0 6
#7  4.0 3.0 2.0 7
#8  5.0 4.0 3.0 8
#9  6.0 5.0 4.0 9

Hope this helps!

MichalChromcak commented 3 years ago

@mloning ad the result of DummyRegressor output - the default strategy of DummyRegressor is mean (but mean of whole training dataset)

Parts from the DummyRegressor docstrings

Parameters
    ----------
    strategy : str
        Strategy to use to generate predictions.

        * "mean": always predicts the mean of the training set

Examples
    --------
    >>> import numpy as np
    >>> from sklearn.dummy import DummyRegressor
    >>> X = np.array([1.0, 2.0, 3.0, 4.0])
    >>> y = np.array([2.0, 3.0, 5.0, 10.0])
    >>> dummy_regr = DummyRegressor(strategy="mean")
    >>> dummy_regr.fit(X, y)
    DummyRegressor()
    >>> dummy_regr.predict(X)
    array([5., 5., 5., 5.])
MichalChromcak commented 3 years ago

@therhaag any thoughts on the class versus object mentioned by Markus?

therhaag commented 3 years ago

Hi @mloning - sorry for the delayed response. The reason for passing the class rather than using composition is to avoid the need for a custom get_params/set_params implementation which distinguishes between parameters of the wrapper and parameters of the class being wrapped. The sklearn implementation of these functions in base.py relies on discovering all possible parameters of the estimator from the signature of the constructor, so we need to expose those through the constructor of the wrapper. There may be a simpler way to do this, though, more close to how composite models behave - in fact, I remember looking into this option but I can't remember what was the difficulty with that, so I'd be happy to consider alternatives.

mloning commented 3 years ago

@therhaag but sklearn supports composition with nested parameters using their double underscore convention, I think that would be the cleaner solution in the end.

therhaag commented 3 years ago

Hi @mloning - I think you're right. We added the sklearn wrapper after the other wrappers which are designed for models which don't follow the sklearn convention themselves, applying the same mechanism. For the sklearn models, one could indeed use composition.