linkedin / greykite

A flexible, intuitive and fast forecasting library
BSD 2-Clause "Simplified" License
1.81k stars 106 forks source link

Input Data Has Many Null Values #100

Closed Pushkaran-P closed 1 year ago

Pushkaran-P commented 1 year ago

Hello all,

I am trying to fit my time series data with greykite using the Silverkite algorithm, even though the model works there are some changes I'd like to avoid, the dataset I am using df.csv `

def getchangepoint(tempdf):

if(len(tempdf['time'])>=36):
    ncp = 0.2
elif(len(tempdf['time'])<24 and len(tempdf['time'])>=12):
    ncp = 0.1
elif(len(tempdf['time'])<36 and len(tempdf['time'])>=24):
    ncp = 0.15
else:
    ncp = 0.05

changepoints = {
"changepoints_dict": dict(
    method="auto",
    regularization_strength=None,
    no_changepoint_proportion_from_end=ncp)
}
return changepoints

def getmodelcomponentsparam(df_merged):

changepoints = getchangepoint(df_merged)
seasonality = dict(
    yearly_seasonality= "auto",
    quarterly_seasonality=False,
    monthly_seasonality=False,
    weekly_seasonality=False,
    daily_seasonality=False )
growth = dict(growth_term=["linear","quadratic"])
events = dict(holiday_lookup_countries=None)#None
uncertainty = {"uncertainty_dict": ["auto",None]}
custom = dict(fit_algorithm_dict = [dict(fit_algorithm="ridge"), dict(fit_algorithm="linear")], feature_sets_enabled=None,
max_daily_seas_interaction_order=None,
max_weekly_seas_interaction_order=None,
extra_pred_cols =  None)
hyperparameter_override = dict(degenerate__drop_degenerate=True, 
                               input__response__null__impute_algorithm = None,
                              input__response__null__impute_all=False,
                               input__regressors_numeric__null__impute_algorithm=None,
                               input__regressors_numeric__null__impute_all=False)

model_components = ModelComponentsParam(
     seasonality=seasonality,
     growth=growth,
     events=events,
     changepoints=changepoints,
     autoregression=None,
     uncertainty = uncertainty, 
     custom= custom,
     hyperparameter_override = hyperparameter_override)
return model_components

def silverkitemodel(df_merged,firstnonzeropos):

model_components = getmodelcomponentsparam(df_merged)
metadata = MetadataParam( time_col="time", value_col='val', freq="MS")
forecast_horizon = 12 if len(df_merged['time']) > 36 else 6
test_horizon = 3
forecaster = Forecaster()
result = forecaster.run_forecast_config(
    df=df_merged,
    config=ForecastConfig(
        model_template=ModelTemplateEnum.SILVERKITE.name,
        forecast_horizon=forecast_horizon,
        coverage=0.95,
        metadata_param=metadata,
        model_components_param=model_components,
        computation_param = ComputationParam(hyperparameter_budget=4,n_jobs=-1,verbose=1),
        evaluation_period_param = EvaluationPeriodParam(cv_max_splits=2,test_horizon = test_horizon)#,
                    )
        )

forecast = result.forecast
forecast_values = forecast.df[last_col_index-firstnonzeropos:]['forecast'].values
actual_val_train = forecast.df[:last_col_index-test_horizon]['actual'].values
train_val = forecast.df[:last_col_index-test_horizon]['forecast'].values
train_acc = 1 - np.sum(np.absolute(actual_val_train - train_val))/np.sum(actual_val_train)

actual_val_test = forecast.df[last_col_index-test_horizon:last_col_index]['actual'].values
test_val = forecast.df[last_col_index-test_horizon:last_col_index]['forecast'].values
test_acc = 1 - np.sum(np.absolute(actual_val_test - test_val))/np.sum(actual_val_test)
#print(result.model[-1].summary(max_colwidth=30).info_dict)

return forecast_values,train_acc,test_acc

` Consider firstnonzeropos as 0 The error message I'm facing

error

Best Regards, Pushkar

sayanpatra commented 1 year ago

Your data is in a nonstandard format: %d/%m/%y instead of %m/%d/%y. Thus, default numpy datetime format inference was failing. Pass date_format = "%d/%m/%y" in MetadataParam to get rid of this issue, check the attached image.

df = pd.read_csv("/Users/sapatra/Downloads/df.csv")
ts = UnivariateTimeSeries()
ts.load_data(df=df, time_col="time", value_col="val", date_format="%d/%m/%y")
# ts.df.head(25)
ts.plot()

newplot

A few suggestions for model training:

  1. You have only 56 datapoints. So start with the simplest model.
  2. In my experience, with a lot of features, ridge works the best for prediction.
  3. Try the monthly template by passing model_template=ModelTemplateEnum.SILVERKITE_MONTHLY.name
  4. Check the monthly_tutorial for an in-depth explanation: https://github.com/linkedin/greykite/blob/master/docs/nbpages/tutorials/0200_monthly_data.py