mikekeith52 / scalecast

The practitioner's forecasting library
MIT License
332 stars 40 forks source link

Generating new dates / Frequency not understood #66

Closed raedbsili1991 closed 1 year ago

raedbsili1991 commented 1 year ago

When creating the forecaster object, and inputing the database, it seems that the ovject doesn't detect the "Freq", as it shows Freq = None, when displaying the forecaster obeject:

f = Forecaster( y = data['Original Date'], # required current_dates = data['Month Date'], # required future_dates=18, cis = False, # choose whether or not to evaluate confidence intervals for all models, metrics = ['mae','r2','rmse','mape'], # the metrics to evaluate when testing/tuning models )

df_.xlsx

Either I use the "Month Date" of the "Original date", it is displaying always None as Frequency.

Actually, the idea was to transform the Quantity with a non organised Original date to Montly Quantity by summing up over each month and perform the prediction to 18 months in the future, that's why I put future_dates = 18.

The full code:

` def forecaster_0(f): for m in models:

        f.set_estimator(m)
        f.auto_Xvar_select(estimator=m) 
        f.determine_best_series_length(estimator =m)
        f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
        f.cross_validate(k = 5,verbose=True) 
        f.auto_forecast()
        f.drop_all_Xvars()

def forecaster_1(f): 

    f.add_metric(custom_metric_3)
    f.set_validation_metric('custom_metric_3')

    #f.generate_future_dates(length_forecast_horizon_deployement)
    #f.set_last_future_date(FORECAST_DATE)
    #f.generate_future_dates(18)
    lenghts_to_train_in_past = 20 
    f.set_validation_length(lenghts_to_train_in_past)
    f.set_test_length(.25)
    #f.eval_cis() # tell the object to build confidence intervals for all models
    f.add_ar_terms(2)
    f.add_AR_terms((2,8))
    f.add_time_trend()

    f.add_seasonal_regressors('month','quarter','week','dayofyear',raw=False,sincos=True)
    f.add_seasonal_regressors('dayofweek','is_leap_year','week',raw=False,dummy=True,drop_first=True)
    f.add_seasonal_regressors('year')

    f.add_sklearn_estimator(StackingRegressor,called='stacking')
    f.add_sklearn_estimator(AdaBoostRegressor,called='adaboost')
    models = ('lasso', 'gbt','ridge','adaboost', 'xgboost')
    f.tune_test_forecast(models, dynamic_testing=18, 
                                cross_validate= True, summary_stats = True, dynamic_tuning=True,verbose=True)

transformer, reverter = find_optimal_transformation(f)
display(transformer)

pipeline = Pipeline(
steps = [
    #('Transform',transformer),
    ('Forecast',forecaster_1),
    #('Revert',reverter),
])

f = pipeline.fit_predict(f)

# for m in models:

#     # f.set_estimator(m)
#     # f.auto_Xvar_select(estimator=m) 
#     # f.determine_best_series_length(estimator =m)
#     # f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
#     # f.cross_validate(k = 5,verbose=False) 
#     # f.auto_forecast()
#     f.drop_all_Xvars()
#     f.set_estimator(m)
#     f.auto_Xvar_select(estimator=m)
#     f.determine_best_series_length(estimator=m)
#     f.tune()
#     f.cross_validate()
#     f.auto_forecast()
#     f.restore_series_length()

f.plot_fitted(order_by='TestSetMAE') # plot fitted values of all models ordered by r2
plt.title(f'{f.estimator} fitted_results results',size=16)
plt.show()
df_models = plot_test_export_summaries(f)

f.plot(order_by='TestSetMAE')
plt.title(f'{f.estimator} Forecasting results',size=16)
plt.show()

data_forecast = f.export(to_excel=True,excel_name=forecasting_file, cis = False)
display(df_models)

df_forecast = pd.read_excel(forecasting_file, sheet_name="lvl_fcsts")
df_forecast['DATE'] = pd.to_datetime(df_forecast['DATE'])
df_forecast = df_forecast[df_forecast['DATE'].dt.weekday == 0]
display(df_forecast) 
print("Plotting AutoCorrelation")
f.plot_acf()`

OUTPUTS/RESULTS:

Despite the results on the Test set doesn't seem to be very terrible, the forecasts on the future aren't well generated, I suspect again the non detection of the Frequency, maybe another reason I am missing ?

image

image

image

ModelNickname | HyperParams | InSampleMAE | TestSetMAE | InSampleR2 | InSampleRMSE | TestSetR2 | TestSetRMSE -- | -- | -- | -- | -- | -- | -- | -- ridge | {} | 53.592252 | 130.605114 | 0.869073 | 66.893779 | 0.299019 | 165.251654 adaboost | {} | 19.599311 | 151.450000 | 0.972589 | 30.607988 | 0.065381 | 190.813967 lasso | {} | 29.156306 | 170.330127 | 0.968061 | 33.039497 | -0.204901 | 216.654821 gbt | {} | 0.115958 | 196.614226 | 0.999999 | 0.132265 | -0.162170 | 212.778432 xgboost | {} | 0.000439 | 197.913553 | 1.000000 | 0.000614 | -0.532267 | 244.320503

FORECASTING VALUES (Supposed to be 18 months in future)

  | DATE | lasso | gbt | ridge | adaboost | xgboost -- | -- | -- | -- | -- | -- | -- 2023-05-08 | 385.178400 | 660.264735 | 436.787649 | 640.0 | 473.736572 2023-05-15 | 431.244985 | 637.779493 | 463.456754 | 640.0 | 473.736572

image

And as a consequence, the f.seasonal_decompose()as well as the pipeline_backtest don't work.

Also, the find_optimal_transformation wasn't useful and it did degraded significantly the results.

mikekeith52 commented 1 year ago

So scalecast ports the pandas frequency logic for dates, and that's been pretty reliable for monthly frequencies in my experience. Can you share the array of dates you tried throwing into the Forecaster object? Sometimes if there is a duplicate or inconsistency, it can confuse the auto-frequency operations.

Thanks,

raedbsili1991 commented 1 year ago

So scalecast ports the pandas frequency logic for dates, and that's been pretty reliable for monthly frequencies in my experience. Can you share the array of dates you tried throwing into the Forecaster object? Sometimes if there is a duplicate or inconsistency, it can confuse the auto-frequency operations.

Thanks,

Thanks for rapid response. Sure, it is in the thread above, I attched the dataframe.

mikekeith52 commented 1 year ago

I see the issue. The dataset you provided is missing several months -- April, May, and November of 2020; February of 2022; and February of 2023. I would suggest adding the missing dates and filling them in with some logical value (are they missing because they are 0, for example?). Or, if you feel you should not do that, feed the data into the Forecaster object with a numerical index in lieu of a date (one that counts from 0 through the length of the time series). This would cause the monthly seasonality to go undetected by the object, but you can force it to detect 12 as a seasonal cycle by using f.add_cycle(12) or f.auto_Xvar_select(irr_cycles=[12]). However, because of the missing dates, the seasonal cycle is not really 12, so that might make the accuracy of the model degrade. Filling in the missing dates would really be the best option, in my opinion, if you can.

raedbsili1991 commented 1 year ago

Thank you, I think adding a function that add a "0" (or optional input such as another value or a NaN) to each missing month could be useful, as this occurs often in time series forecasting.

I tried feeding the data under a numerical index, however, that deters significantly the performance.

mikekeith52 commented 1 year ago

Yes, assisting missing value imputation has been on my list of to-dos for a while. I will start working on it. If you need anything else related to this issue, please feel free to respond. Otherwise, let me know if I can close it.

raedbsili1991 commented 1 year ago

Okay thank you. One last question over the seasonality here, any "trick" to manually put the right f.add_ar_terms() and f.add_AR_terms after visually looking at the time series plot (or after a AF/PACF plot) ?

mikekeith52 commented 1 year ago

I'm not sure there is a consensus "best way" to do that. You might try running an auto SARIMA model and seeing what lag order comes from that. But generally, after looking at ACF and PACF plots, you are trying to find places where the graphs "spike". If there are noticeable seasonal spikes, that could be justification for adding seasonal lags using f.add_AR_terms(). Otherwise, add as many lags using f.add_ar_terms() as there are spikes in the plots.

mikekeith52 commented 1 year ago

The function Forecaster_with_missing_vals() has been added to the library. Here's one way you could use it:

from scalecast.util import Forecaster_with_missing_vals
import pandas as pd
import numpy as np

data = pd.read_excel('df_.xlsx') # the dataset attached to this thread

f = Forecaster_with_missing_vals(
    y = data['Monthly Quantity'],
    current_dates = data['Month Date'],
    desired_frequency = 'MS',
    fill_strategy = 0.0, # fills with 0s, but other options are available
    test_length = .25,
    future_dates = 18,
).round()
raedbsili1991 commented 1 year ago

The function Forecaster_with_missing_vals() has been added to the library. Here's one way you could use it:

from scalecast.util import Forecaster_with_missing_vals
import pandas as pd
import numpy as np

data = pd.read_excel('df_.xlsx') # the dataset attached to this thread

f = Forecaster_with_missing_vals(
    y = data['Monthly Quantity'],
    current_dates = data['Month Date'],
    desired_frequency = 'MS',
    fill_strategy = 0.0, # fills with 0s, but other options are available
    test_length = .25,
    future_dates = 18,
).round()

That's great. I already did that with a simple function, I was gonna put it down. We can close this thread. Thank you again.