mindsdb / lightwood

Lightwood is Legos for Machine Learning.
GNU General Public License v3.0
434 stars 92 forks source link

[StatsForecast] ValueError: Input contains NaN, infinity or a value too large for dtype('float64') #1180

Closed chandrevdw31 closed 10 months ago

chandrevdw31 commented 11 months ago

Short description of current behavior

A user tried creating a stats forecast model, but forgot to create the engine for stats forecast and got the error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')."

https://mindsdbcommunity.slack.com/archives/C01S2T35H18/p1689695859513139

pedrofluxa commented 10 months ago

Description

Based on the discussion on the Slack channel, I believe the problem comes from the way the group_by is being handled when no group_by argument is specified.

First pass analysis

In particular, the code block below indicates that two different functions are triggered depending on the hierarchy keyword.

# line 102 /mindsdb/integrations/handlers/statsforecast_handler/statsforecast_handler.py
        if 'group_by' not in time_settings:
            # add group column
            group_col = '__groupy_by'
            time_settings["group_by"] = [group_col]

        model_args["group_by"] = time_settings["group_by"]
        model_args["frequency"] = (
            using_args["frequency"] if "frequency" in using_args else infer_frequency(df, time_settings["order_by"])
        )
        model_args["hierarchy"] = using_args["hierarchy"] if "hierarchy" in using_args else False
        if model_args["hierarchy"]:
            training_df, hier_df, hier_dict = get_hierarchy_from_df(df, model_args)
            self.model_storage.file_set("hier_dict", dill.dumps(hier_dict))
            self.model_storage.file_set("hier_df", dill.dumps(hier_df))
        else:
            training_df = transform_to_nixtla_df(df, model_args)

I checked the transform_to_nixtla_df function and it seems to me it handles a dummy group_by keyword just fine.

# line 34 in mindsdb/integrations/utilities/time_series_utils.py
    if group_col not in df.columns:
        # add to dataframe
        nixtla_df[group_col] = '1'

However, the get_hierarchy_from_df() does not check if the column specified by the group_by keyword exists, and sow it might trigger the creation of a new column filled with NaN

# line 106 in mindsdb/integrations/utilities/time_series_utils.py
def get_hierarchy_from_df(df, model_args):
    """Extracts hierarchy from the raw df, using the provided spec and args.

    The "hierarchy" model arg is a list of format
    [<level 1>, <level 2>, ..., <level n>]
    where each element is a level in the hierarchy.

    We return a tuple (nixtla_df, hier_df, hier_dict) where:
    nixtla_df is a dataframe in the format nixtla packages uses for training
    hier_df is a matrix of 0s and 1s showing the hierarchical structure
    hier_dict is a dictionary with the hierarchical structure. See the unit test
    in tests/unit/ml_handlers/test_time_series_utils.py for an example.
    """
    spec = spec_hierarchy_from_list(model_args["hierarchy"])

    nixtla_df = df.rename({model_args["order_by"]: "ds", model_args["target"]: "y"}, axis=1)
    nixtla_df["ds"] = pd.to_datetime(nixtla_df["ds"])
    for col in model_args["group_by"]:
        nixtla_df[col] = nixtla_df[col].astype(str)  # grouping columns need to be string format
    nixtla_df.insert(0, "Total", "total")

    nixtla_df, hier_df, hier_dict = aggregate(nixtla_df, spec)  # returns (nixtla_df, hierarchy_df, hierarchy_dict)
    return nixtla_df, hier_df, hier_dict

Suggestion

I suggest changing the aforementioned function to this

# line 106 in mindsdb/integrations/utilities/time_series_utils.py
def get_hierarchy_from_df(df, model_args):
    """Extracts hierarchy from the raw df, using the provided spec and args.

    The "hierarchy" model arg is a list of format
    [<level 1>, <level 2>, ..., <level n>]
    where each element is a level in the hierarchy.

    We return a tuple (nixtla_df, hier_df, hier_dict) where:
    nixtla_df is a dataframe in the format nixtla packages uses for training
    hier_df is a matrix of 0s and 1s showing the hierarchical structure
    hier_dict is a dictionary with the hierarchical structure. See the unit test
    in tests/unit/ml_handlers/test_time_series_utils.py for an example.
    """
    spec = spec_hierarchy_from_list(model_args["hierarchy"])

    # BEGIN MODIFICATION
    # force creation of a column that can be used as unique_id by statsforecast
    if model_args["group_by"] not in df.columns:
        # add to dataframe
        nixtla_df[model_args["group_by"]] = '1'
    # END MODIFICATION

    nixtla_df = df.rename({model_args["order_by"]: "ds", model_args["target"]: "y"}, axis=1)
    nixtla_df["ds"] = pd.to_datetime(nixtla_df["ds"])
    for col in model_args["group_by"]:
        nixtla_df[col] = nixtla_df[col].astype(str)  # grouping columns need to be string format
    nixtla_df.insert(0, "Total", "total")

    nixtla_df, hier_df, hier_dict = aggregate(nixtla_df, spec)  # returns (nixtla_df, hierarchy_df, hierarchy_dict)
    return nixtla_df, hier_df, hier_dict

So that the column specified by the group_by key is always created.

paxcema commented 10 months ago

Ok, a couple of things: