chandrevdw31 commented 1 year ago

Short description of current behavior

A user tried creating a stats forecast model, but forgot to create the engine for stats forecast and got the error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."

https://mindsdbcommunity.slack.com/archives/C01S2T35H18/p1689695859513139

pedrofluxa commented 1 year ago

Description

Based on the discussion on the Slack channel, I believe the problem comes from the way the group_by is being handled when no group_by argument is specified.

First pass analysis

In particular, the code block below indicates that two different functions are triggered depending on the hierarchy keyword.

# line 102 /mindsdb/integrations/handlers/statsforecast_handler/statsforecast_handler.py
        if 'group_by' not in time_settings:
            # add group column
            group_col = '__groupy_by'
            time_settings["group_by"] = [group_col]

        model_args["group_by"] = time_settings["group_by"]
        model_args["frequency"] = (
            using_args["frequency"] if "frequency" in using_args else infer_frequency(df, time_settings["order_by"])
        )
        model_args["hierarchy"] = using_args["hierarchy"] if "hierarchy" in using_args else False
        if model_args["hierarchy"]:
            training_df, hier_df, hier_dict = get_hierarchy_from_df(df, model_args)
            self.model_storage.file_set("hier_dict", dill.dumps(hier_dict))
            self.model_storage.file_set("hier_df", dill.dumps(hier_df))
        else:
            training_df = transform_to_nixtla_df(df, model_args)

I checked the transform_to_nixtla_df function and it seems to me it handles a dummy group_by keyword just fine.

# line 34 in mindsdb/integrations/utilities/time_series_utils.py
    if group_col not in df.columns:
        # add to dataframe
        nixtla_df[group_col] = '1'

However, the get_hierarchy_from_df() does not check if the column specified by the group_by keyword exists, and sow it might trigger the creation of a new column filled with NaN

# line 106 in mindsdb/integrations/utilities/time_series_utils.py
def get_hierarchy_from_df(df, model_args):
    """Extracts hierarchy from the raw df, using the provided spec and args.

    The "hierarchy" model arg is a list of format
    [<level 1>, <level 2>, ..., <level n>]
    where each element is a level in the hierarchy.

    We return a tuple (nixtla_df, hier_df, hier_dict) where:
    nixtla_df is a dataframe in the format nixtla packages uses for training
    hier_df is a matrix of 0s and 1s showing the hierarchical structure
    hier_dict is a dictionary with the hierarchical structure. See the unit test
    in tests/unit/ml_handlers/test_time_series_utils.py for an example.
    """
    spec = spec_hierarchy_from_list(model_args["hierarchy"])

    nixtla_df = df.rename({model_args["order_by"]: "ds", model_args["target"]: "y"}, axis=1)
    nixtla_df["ds"] = pd.to_datetime(nixtla_df["ds"])
    for col in model_args["group_by"]:
        nixtla_df[col] = nixtla_df[col].astype(str)  # grouping columns need to be string format
    nixtla_df.insert(0, "Total", "total")

    nixtla_df, hier_df, hier_dict = aggregate(nixtla_df, spec)  # returns (nixtla_df, hierarchy_df, hierarchy_dict)
    return nixtla_df, hier_df, hier_dict

Suggestion

I suggest changing the aforementioned function to this

# line 106 in mindsdb/integrations/utilities/time_series_utils.py
def get_hierarchy_from_df(df, model_args):
    """Extracts hierarchy from the raw df, using the provided spec and args.

    The "hierarchy" model arg is a list of format
    [<level 1>, <level 2>, ..., <level n>]
    where each element is a level in the hierarchy.

    We return a tuple (nixtla_df, hier_df, hier_dict) where:
    nixtla_df is a dataframe in the format nixtla packages uses for training
    hier_df is a matrix of 0s and 1s showing the hierarchical structure
    hier_dict is a dictionary with the hierarchical structure. See the unit test
    in tests/unit/ml_handlers/test_time_series_utils.py for an example.
    """
    spec = spec_hierarchy_from_list(model_args["hierarchy"])

    # BEGIN MODIFICATION
    # force creation of a column that can be used as unique_id by statsforecast
    if model_args["group_by"] not in df.columns:
        # add to dataframe
        nixtla_df[model_args["group_by"]] = '1'
    # END MODIFICATION

    nixtla_df = df.rename({model_args["order_by"]: "ds", model_args["target"]: "y"}, axis=1)
    nixtla_df["ds"] = pd.to_datetime(nixtla_df["ds"])
    for col in model_args["group_by"]:
        nixtla_df[col] = nixtla_df[col].astype(str)  # grouping columns need to be string format
    nixtla_df.insert(0, "Total", "total")

    nixtla_df, hier_df, hier_dict = aggregate(nixtla_df, spec)  # returns (nixtla_df, hierarchy_df, hierarchy_dict)
    return nixtla_df, hier_df, hier_dict

So that the column specified by the group_by key is always created.

paxcema commented 1 year ago

Ok, a couple of things:

Will close MindsDB#7177, as it errors out but fundamentally it is not the root cause, as hierarchical reconciliation is not being used in this case. A fix for group by was implemented in MindsDB#7082 so he should be unblocked.
The original error in the slack message is actually within Lightwood and easy to replicate with the attached file. The model actually trains correctly on staging, so I will close this and reopen if the user runs into this error again.

mindsdb / lightwood

[StatsForecast] ValueError: Input contains NaN, infinity or a value too large for dtype('float64') #1180

Short description of current behavior

Description

First pass analysis

Suggestion