heidelbergcement / hcrystalball

A library that unifies the API for most commonly used libraries and modeling techniques for time-series forecasting in the Python ecosystem.
https://hcrystalball.readthedocs.io/
MIT License
152 stars 19 forks source link

Fix short data CV #70

Open MichalChromcak opened 2 years ago

MichalChromcak commented 2 years ago

In certain cases, the current behaviour of the scorer stores the cv_data wrongly (see the split numbers). Further plotting and evaluation functionality is partially affected by that.

This MR aims to fix this.

Example:

import pandas as pd
from sklearn.linear_model import LinearRegression

df = pd.DataFrame(
    {"target":list(range(7))}, 
    index=pd.date_range(start="2021-03-31", end="2021-09-30", freq="M")
)
target
2021-03-310
2021-04-301
2021-05-312
2021-06-303
2021-07-314
2021-08-315
2021-09-306

ms = ModelSelector(
    horizon=1,
    frequency="M",
)
ms.create_gridsearch(
    sklearn_models=False,
    n_splits=4,
    between_split_lag=None,
    sklearn_models_optimize_for_horizon=False,
    autosarimax_models=False,
    prophet_models=False,
    tbats_models=False,
    exp_smooth_models=False,
    average_ensembles=False,
    stacking_ensembles=False,
)
ms.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, name="linreg_3", lags=3))
ms.add_model_to_gridsearch(get_sklearn_wrapper(LinearRegression, name="linreg_1", lags=1))

ms.select_model(
    df=df,
    target_col_name="target",
)

print(ms.results[0].cv_data)
splity_trueb80cee186b053880a84ec8d7c4692365e474b1ddba8a0a6f849b49abf903a4e3
2021-07-3104.03.03.0
2021-08-3115.05.04.0
2021-09-3026.06.05.0
2021-06-3003.0NaN3.0
MichalChromcak commented 2 years ago

@pavelkrizek FYI

codecov-commenter commented 2 years ago

Codecov Report

Merging #70 (24bbfb3) into master (11166bd) will decrease coverage by 0.06%. The diff coverage is 90.24%.

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
- Coverage   93.79%   93.73%   -0.07%     
==========================================
  Files          56       56              
  Lines        2853     2888      +35     
==========================================
+ Hits         2676     2707      +31     
- Misses        177      181       +4     
Impacted Files Coverage Δ
src/hcrystalball/wrappers/_sklearn.py 94.80% <ø> (ø)
src/hcrystalball/metrics/_scorer.py 90.66% <78.94%> (-4.50%) :arrow_down:
tests/unit/metrics/test_scorer.py 93.93% <100.00%> (+1.73%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eeb32b9...24bbfb3. Read the comment docs.

pavelkrizek commented 2 years ago

@MichalChromcak Thanks for catching and fixing the bug! Everything is good from my side, just the function results_to_cv_data is quite complex and it's hard to see what exactly is happening there, so a more descriptive docstring would be helpful.