Closed chandrevdw31 closed 1 year ago
Based on the discussion on the Slack channel, I believe the problem comes from the way the group_by
is being handled when no group_by
argument is specified.
In particular, the code block below indicates that two different functions are triggered depending on the hierarchy
keyword.
# line 102 /mindsdb/integrations/handlers/statsforecast_handler/statsforecast_handler.py
if 'group_by' not in time_settings:
# add group column
group_col = '__groupy_by'
time_settings["group_by"] = [group_col]
model_args["group_by"] = time_settings["group_by"]
model_args["frequency"] = (
using_args["frequency"] if "frequency" in using_args else infer_frequency(df, time_settings["order_by"])
)
model_args["hierarchy"] = using_args["hierarchy"] if "hierarchy" in using_args else False
if model_args["hierarchy"]:
training_df, hier_df, hier_dict = get_hierarchy_from_df(df, model_args)
self.model_storage.file_set("hier_dict", dill.dumps(hier_dict))
self.model_storage.file_set("hier_df", dill.dumps(hier_df))
else:
training_df = transform_to_nixtla_df(df, model_args)
I checked the transform_to_nixtla_df
function and it seems to me it handles a dummy group_by
keyword just fine.
# line 34 in mindsdb/integrations/utilities/time_series_utils.py
if group_col not in df.columns:
# add to dataframe
nixtla_df[group_col] = '1'
However, the get_hierarchy_from_df()
does not check if the column specified by the group_by
keyword exists, and sow it might trigger the creation of a new column filled with NaN
# line 106 in mindsdb/integrations/utilities/time_series_utils.py
def get_hierarchy_from_df(df, model_args):
"""Extracts hierarchy from the raw df, using the provided spec and args.
The "hierarchy" model arg is a list of format
[<level 1>, <level 2>, ..., <level n>]
where each element is a level in the hierarchy.
We return a tuple (nixtla_df, hier_df, hier_dict) where:
nixtla_df is a dataframe in the format nixtla packages uses for training
hier_df is a matrix of 0s and 1s showing the hierarchical structure
hier_dict is a dictionary with the hierarchical structure. See the unit test
in tests/unit/ml_handlers/test_time_series_utils.py for an example.
"""
spec = spec_hierarchy_from_list(model_args["hierarchy"])
nixtla_df = df.rename({model_args["order_by"]: "ds", model_args["target"]: "y"}, axis=1)
nixtla_df["ds"] = pd.to_datetime(nixtla_df["ds"])
for col in model_args["group_by"]:
nixtla_df[col] = nixtla_df[col].astype(str) # grouping columns need to be string format
nixtla_df.insert(0, "Total", "total")
nixtla_df, hier_df, hier_dict = aggregate(nixtla_df, spec) # returns (nixtla_df, hierarchy_df, hierarchy_dict)
return nixtla_df, hier_df, hier_dict
I suggest changing the aforementioned function to this
# line 106 in mindsdb/integrations/utilities/time_series_utils.py
def get_hierarchy_from_df(df, model_args):
"""Extracts hierarchy from the raw df, using the provided spec and args.
The "hierarchy" model arg is a list of format
[<level 1>, <level 2>, ..., <level n>]
where each element is a level in the hierarchy.
We return a tuple (nixtla_df, hier_df, hier_dict) where:
nixtla_df is a dataframe in the format nixtla packages uses for training
hier_df is a matrix of 0s and 1s showing the hierarchical structure
hier_dict is a dictionary with the hierarchical structure. See the unit test
in tests/unit/ml_handlers/test_time_series_utils.py for an example.
"""
spec = spec_hierarchy_from_list(model_args["hierarchy"])
# BEGIN MODIFICATION
# force creation of a column that can be used as unique_id by statsforecast
if model_args["group_by"] not in df.columns:
# add to dataframe
nixtla_df[model_args["group_by"]] = '1'
# END MODIFICATION
nixtla_df = df.rename({model_args["order_by"]: "ds", model_args["target"]: "y"}, axis=1)
nixtla_df["ds"] = pd.to_datetime(nixtla_df["ds"])
for col in model_args["group_by"]:
nixtla_df[col] = nixtla_df[col].astype(str) # grouping columns need to be string format
nixtla_df.insert(0, "Total", "total")
nixtla_df, hier_df, hier_dict = aggregate(nixtla_df, spec) # returns (nixtla_df, hierarchy_df, hierarchy_dict)
return nixtla_df, hier_df, hier_dict
So that the column specified by the group_by
key is always created.
Ok, a couple of things:
group by
was implemented in MindsDB#7082 so he should be unblocked.staging
, so I will close this and reopen if the user runs into this error again.
Short description of current behavior
A user tried creating a stats forecast model, but forgot to create the engine for stats forecast and got the error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."
https://mindsdbcommunity.slack.com/archives/C01S2T35H18/p1689695859513139